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Introduction 


Since the invention of the personal computer, software developers have used x86 assembly language 
to create innovative solutions for a wide variety of algorithmic challenges. During the early days of the 
PC era, it was common practice to code large portions of a program or complete applications using x86 
assembly language. Given the 21st Century prevalence of high-level languages such as C++, C£, Java, and 
Python, it may be surprising to learn that many software developers still employ assembly language to code 
performance-critical sections of their programs. And while compilers have improved remarkably over the 
years in terms of generating machine code that is both spatially and temporally efficient, situations still exist 
where it makes sense for a software developer to exploit the benefits of assembly language programming. 
The single-instruction multiple-data (SIMD) architectures of modern x86 processors provide another 
explanation for the continued interest in assembly language programming. A SIMD-capable processor 
contains computational resources that facilitate simultaneous calculations using multiple data values, which 
can significantly improve the performance of applications that must deliver real-time responsiveness. SIMD 
architectures are also well-suited for computationally-intense problem domains, such as image processing, 
audio and video encoding, computer-aided design, computer graphics, and data mining. Unfortunately, 
many high-level languages and development tools are still unable to fully or even partially exploit the 
SIMD capabilities of a modern x86 processor. Assembly language, on the other hand, enables the software 
developer to take full advantage of a processor's SIMD resources. 


Modern X86 Assembly Language Programming 


Modern X86 Assembly Language Programming, Second Edition is an edifying text about x86 64-bit (x86-64) 
assembly language programming. The book's content and organization are designed to help you quickly 
understand x86-64 assembly language programming and the computational resources of Advanced Vector 
Extensions (AVX). It also contains an abundance of source code that is structured to accelerate learning and 
comprehension of essential x86-64 assembly language constructs and SIMD programming concepts. After 
reading and using this book, you'll be able to code performance-enhancing functions and algorithms using 
x86-64 assembly language and the AVX, AVX2, and AVX-512 instruction sets. 

Before proceeding I should explicitly mention that this book does not cover x86-32 assembly language 
programming. It also doesn't discuss legacy x86 technologies such as the x87 floating-point unit, MMX, 
and Streaming SIMD Extensions. The first edition remains relevant if you're interested in learning about 
these topics. This book does not explain x86 architectural features or privileged instructions that are used 
in operating systems. However, you will need to thoroughly understand the material that's presented in this 
book to develop x86 assembly language code for use in an operating system. 

While it is still theoretically possible to write an entire application program using assembly language, 
the demanding requirements of contemporary software development make such an approach impractical 
and ill advised. Instead, this book concentrates on coding x86-64 assembly language functions that are 
callable from C++. Each source code example was created using Microsoft Visual Studio C++ and Microsoft 
Macro Assembler (MASM). 
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INTRODUCTION 


Target Audience 


The target audience for this book is software developers, including: 


e Software developers who are creating application programs for Windows-based 
platforms and want to learn how to write performance-enhancing algorithms and 
functions using x86-64 assembly language 


e Software developers who are creating application programs for non-Windows 
environments and want to learn x86-64 assembly language programming 


e Software developers who want to learn how to create SIMD calculating functions 
using the AVX, AVX2, and AVX-512 instruction sets 


e Software developers and computer science students who want or need to gain a 
better understanding of the x86-64 platform and its SIMD architecture 


The principal audience for Modern X86 Assembly Language Programming, Second Edition is Windows 
software developers, since the source code examples were developed using Visual Studio C++ and MASM. 
Software developers who are targeting non-Windows platforms can also benefit from this book since most 
of the informative content is organized and communicated independent of any specific operating system. It 
is assumed that readers of this book will have previous high-level language programming experience and a 
basic understanding of C++. Familiarity with Visual Studio or Windows programming is not necessary. 


Content Overview 


The primary objective of this book is to help you learn x86 64-bit assembly language programming along 
with AVX, AVX2, and AVX-512. The book's chapters and content are structured to achieve this goal. Here's a 
brief overview of what you can expect to learn. 

Chapter 1 covers the core architecture of the x86-64 platform. It includes a discussion of the 
platform's fundamental data types, internal architecture, register sets, instruction operands, and memory 
addressing modes. This chapter also describes the core x86-64 instruction set. Chapters 2 and 3 explain 
the fundamentals of x86-64 assembly language programming using the core instruction set and common 
programming constructs, including arrays and structures. The source code examples presented in these 
(and subsequent) chapters are packaged as working programs, which means that you can run, modify, or 
otherwise experiment with the code to enhance your learning experience. 

Chapter 4 focuses on the architectural resources of AVX including its register sets, data types, and 
instruction set. Chapter 5 explains how to use the AVX instruction set to perform scalar floating-point 
arithmetic using both single-precision and double-precision values. Chapters 6 and 7 illustrate AVX SIMD 
programming using packed floating-point and packed integer operands. 

Chapter 8 introduces AVX2 and explores its enhanced capabilities including data broadcasts, gathers, 
and permutes. It also explains fused-multiply-add (FMA) operations. Chapters 9 and 10 contain source code 
examples that exemplify a variety of computational algorithms using AVX2 with packed floating-point and 
packed integer operands. Chapter 11 includes source code examples that demonstrate FMA programming. 
This chapter also covers examples that explicate recent x86 platform extensions using the general-purpose 
registers. 

Chapter 12 delves into the architectural details of AVX-512. This chapter describes AVX-512's register 
sets and data types. It also elucidates pivotal AVX-512 enhancements including conditional execution and 
merging, embedded broadcast operations, and instruction-level rounding. Chapters 13 and 14 contain 
numerous source code examples that demonstrate how to exploit these advanced features. 
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Chapter 15 presents an overview of a modern x86 multi-core processor and its underlying 
microarchitecture. This chapter also outlines specific coding strategies and techniques that can be used to 
boost the performance of x86 assembly language code. Chapter 16 reviews several source code examples 
that illustrate advanced x86 assembly language programming techniques including processor feature 
detection, accelerated memory accesses, and multithreaded computations. 

Appendix A describes how to execute the source code examples using Visual Studio and MASM. It also 
includes a list of references and resources that you can consult for more information about x86 assembly 
language programming. 


Source Code 


Source code download information for this book is available on the Apress website at https: //www.apress. 
com/us/book/9781484240625. For each chapter, there is a ZIP file that contains the C++ and assembly 
language source code files along with the Visual Studio project files. There is no setup or install program to 
run. You can simply extract the contents of a chapter ZIP file into a folder of your own choosing. 


Caution The sole purpose of the source code is to elucidate programming examples that are directly 
related to the topics discussed in this book. Minimal attention is given to essential software engineering 
concerns such as robust error handling, security risks, numerical stability, rounding errors, or ill-conditioned 
functions. You are responsible for addressing these issues should you decide to use any of the source code in 
your own programs. 


The source code examples were created using Visual Studio Professional 2017 (version 15.7.1) ona 
PC running Windows 10 Pro 64-bit. The Visual Studio website (https: //visualstudio.microsoft.com) 
contains more information about this and the other editions of Visual Studio. Technical details regarding 
Visual Studio installation, configuration, and application program development are available at https: // 
docs.microsoft.com/en-us/visualstudio/?view=vs-2017. 

The recommended hardware platform for running the source code examples is an x86-based PC 
with Windows 10 64-bit and a processor that supports AVX. An AVX2 or AVX-512 compatible processor 
is required to run the source code examples that employ these instruction sets. You can use one of freely 
available utilities listed in Appendix A to determine which x86-AVX instruction set extensions your PC 
supports. 


Additional Resources 


An extensive set of x86-related programming documentation is available from both AMD and Intel. 
Appendix A lists several important resources that both aspiring and experienced x86 assembly language 
programmers will find useful. Of all the resources listed Appendix A, the most valuable reference is Volume 
2 of Intel 64 and IA-32 Architectures Software Developer’s Manual - Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 
3B, 3C, 3D, and 4 (https://www. intel. com/content/www/us/en/processors/architectures-software- 
developer-manuals.html). This tome contains comprehensive programming information for every x86 
processor instruction including detailed operational descriptions, lists of valid operands, affected status 
flags, and potential exceptions. You are strongly encouraged to consult this indispensable resource when 
developing your own x86 assembly language code to verify correct instruction usage. 
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CHAPTER 1 


X86-64 Core Architecture 


Chapter 1 examines the x86-64's core architecture from the perspective of an application program. It opens 
with a brief historical overview of the x86 platform in order to provide a frame of reference for subsequent 
content. This is followed by a review of fundamental, numeric, and SIMD data types. X86-64 core 
architecture is examined next, which includes explanations of processor register sets, status flags, instruction 
operands, and memory addressing modes. The chapter concludes with an overview of the core x86-64 
instruction set. 

Unlike high-level languages such as C and C++, assembly language programming requires the software 
developer to comprehend specific architectural features of the target processor before attempting to write 
any code. The topics discussed in this chapter will fulfill this requirement and provide a foundation for 
understanding the sample code that's presented later in this book. This chapter also provides the base 
material that's necessary to understand the x86-64's SIMD enhancements. 


Historical Overview 


Before examining the technical details of the x86-64's core architecture, it can be beneficial to understand 
how the architecture has evolved over the years. The short review that follows focuses on noteworthy 
processors and instruction set enhancements that have affected how software developers use x86 assembly 
language. Readers who are interested in a more comprehensive chronicle of the x86's lineage should consult 
the resources listed in Appendix A. 

The x86-64 processor platform is an extension of the original x86-32 platform. The first silicon 
embodiment of the x86-32 platform was the Intel 80386 microprocessor, which was introduced in 1985. 

The 80386 extended the architecture of the 16-bit 80286 to include 32-bit wide registers and data types, 
flat memory model options, a 4 GB logical address space, and paged virtual memory. The 80486 processor 
improved the performance of the 80386 with the inclusion of on-chip memory caches and optimized 
instructions. Unlike the 80386 with its separate 80387 floating-point unit (FPU), most versions of the 80486 
CPU also included an integrated x87 FPU. 

Expansion of the x86-32 platform continued with the introduction of the first Pentium brand processor 
in 1993. Known as the P5 microarchitecture, performance enhancements included a dual-instruction 
execution pipeline, 64-bit external data bus, and separate on-chip memory caches for both code and 
data. Later versions (1997) of the P5 microarchitecture incorporated a new computational resource called 
MMX technology, which supports single-instruction multiple-data (SIMD) operations on packed integers 
using 64-bit wide registers. A packed integer is a collection of multiple integer values that are processed 
simultaneously. 

The P6 microarchitecture, first used on the Pentium Pro (1995) and later on the Pentium II (1997), 
extended the x86-32 platform using a three-way superscalar design. This means that that the processor is 
able (on average) to decode, dispatch, and execute three distinct instructions during each clock cycle. Other 
P6 augmentations included out-of-order instruction executions, improved branch prediction algorithms, 
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and speculative executions. The Pentium III, also based on the P6 microarchitecture, was launched in 1999 
and included a new SIMD technology called Streaming SIMD extensions (SSE). SSE adds eight 128-bit 
wide registers to the x86-32 platform and instructions that perform packed single-precision floating-point 
arithmetic. 

In 2000 Intel introduced a new microarchitecture called Netburst that included SSE2, which extended 
the floating-point capabilities of SSE to cover packed double-precision values. SSE2 also incorporated 
additional instructions that enabled the 128-bit SSE registers to be used for packed integer calculations and 
scalar floating-point operations. Processors based on the Netburst architecture included several variations 
ofthe Pentium 4. In 2004 the Netburst microarchitecture was upgraded to include SSE3 and hyper-threading 
technology. SSE3 adds new packed integer and packed floating-point instructions to the x86 platform, while 
hyper-threading technology parallelizes the processor's front-end instruction pipelines in order to improve 
performance. SSE3 capable processors include 90 nm (and smaller) versions of the Pentium 4 and Xeon 
product lines. 

In 2006 Intel launched a new microarchitecture called Core. The Core microarchitecture included 
redesigns of many Netburst front-end pipelines and execution units in order to improve performance and 
reduce power consumption. It also incorporated a number of SIMD enhancements including SSSE3 and 
SSE4.1. These extensions added new packed integer and packed floating-point instructions to the platform 
but no new registers or data types. Processors based on the Core microarchitecture include CPUs from the 
Core 2 Duo and Core 2 Quad series and Xeon 3000/5000 series. 

A microarchitecture called Nehalem followed Core in late 2008. This microarchitecture re-introduced 
hyper-threading to the x86 platform, which had been excluded from the Core microarchitecture. The 
Nehalem microarchitecture also incorporates SSEA.2. This final x86-SSE enhancement adds several 
application-specific accelerator instructions to the x86-SSE instruction set. SSE4.2 also includes new 
instructions that facilitate text string processing using the 128-bit wide x86-SSE registers. Processors based 
on the Nehalem microarchitecture include first generation Core i3, i5, and i7 CPUs. It also includes CPUs 
from the Xeon 3000, 5000, and 7000 series. 

In 2011 Intel launched a new microarchitecture called Sandy Bridge. The Sandy Bridge 
microarchitecture introduced a new x86 SIMD technology called Advanced Vector Extensions (AVX). 

AVX adds packed floating-point operations (both single-precision and double-precision) using 256-bit wide 
registers. AVX also supports a new three-operand instruction syntax, which improves code efficiency by 
reducing the number of register-to-register data transfers that a software function must perform. Processors 
based on the Sandy Bridge microarchitecture include second and third generation Core i3, i5, and i7 CPUs 
along with Xeon V2 series CPUs. 

In 2013 Intel unveiled its Haswell microarchitecture. Haswell includes AVX2, which extends AVX to 
support packed-integer operations using 256-bit wide registers. AVX2 also supports enhanced data transfer 
capabilities with its broadcast, gather, and permute instructions. (Broadcast instructions replicate a single 
value to multiple locations; data gather instructions load multiple elements from non-contiguous memory 
locations; permute instructions rearrange the elements of a packed operand.) Another feature of the Haswell 
microarchitecture is its inclusion of fused-multiply-add (FMA) operations. FMA enables software algorithms 
to perform product-sum (or dot product) calculations using a single floating-point rounding operation, 
which can improve both performance and accuracy. The Haswell microarchitecture also encompasses 
several new general-purpose register instructions. Processors based on the Haswell microarchitecture 
include fourth generation Core i3, i5, and i7 CPUs. AVX2 is also included later generations of Core family 
CPUs, and in Xeon V3, V4, and V5 series CPUs. 

X86 platform extensions over the past several years have not been limited to SIMD enhancements. In 
2003 AMD introduced its Opteron processor, which extended the x86's execution platform from 32 bits to 
64 bits. Intel followed suit in 2004 by adding essentially the same 64-bit extensions to its processors starting 
with certain versions of the Pentium 4. All Intel processors based on the Core, Nehalem, Sandy Bridge, 
Haswell, and Skylake microarchitectures support the x86-64 execution environment. 

Processors from AMD have also evolved over the past few years. In 2003 AMD introduced a series of 
processors based on its K8 microarchitecture. Original versions of the K8 included support for MMX, SSE, 
and SSE2 while later versions added SSE3. In 2007 the K10 microarchitecture was launched and included a 
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SIMD enhancement called SSE4a. SSE4a contains several mask shift and streaming store instructions that 
are not available on processors from Intel. Following the K10, AMD introduced a new microarchitecture 
called Bulldozer in 2011. The Bulldozer microarchitecture includes SSSE3, SSE4.1, SSE4.2, SSE4a, and 
AVX. It also includes FMA4, which is a four-operand version of fused-multiply-add. Like SSE4a, processors 
marketed by Intel do not support FMAA instructions. A 2012 update to the Bulldozer microarchitecture 
called Piledriver includes support for both FMA4 and the three-operand version of FMA, which is called 
FMAS by some CPU feature-detection utilities and third-party documentation sources. The most recent 
AMD microarchitecture, introduced during 2017, is called Zen. This microarchitecture includes the AVX2 
instruction set enhancements and is used in the Ryzen series of processors. 

High-end desktop and server-oriented processors based on Intel's Skylake-X microarchitecture, also 
first marketed during 2017, include a new SIMD extension called AVX-512. This architectural enhancement 
supports packed integer and floating-point operations using 512-bit wide registers. AVX-512 also includes 
architectural additions that facilitate instruction-level conditional data merging, floating-point rounding 
control, and broadcast operations. Over the next few years, it is expected that both AMD and Intel will 
incorporate AVX-512 into their mainstream processors for desktop and notebook PCs. 


Data Types 


Programs written using x86 assembly language can use a wide variety of data types. Most program data 
types originate from a small set of fundamental data types that are intrinsic to the x86 platform. These 
fundamental data types enable the processor to perform numerical and logical operations using signed 
and unsigned integers, single-precision (32-bit) and double-precision (64-bit) floating-point values, text 
strings, and SIMD values. In this section, you'll learn about the fundamental data types along with a few 
miscellaneous data types supported by the x86. 


Fundamental Data Types 


A fundamental data type is an elementary unit of data that is manipulated by the processor during program 
execution. The x86 platform supports fundamental data types ranging in size from 8 bits (1 byte) to 128 bits 
(16 bytes). Table 1-1 shows these types along with typical use patterns. 


Table 1-1. Fundamental Data Types 


Data Type Size (Bits) Typical Use 

Byte 8 Characters, small integers 

Word 16 Characters, integers 

Doubleword 32 Integers, single-precision floating-point 
Quadword 64 Integers, double-precision floating-point 
Double Quadword 128 Packed integers, packed floating-point 


Unsurprisingly, the fundamental data types are sized using integer powers of two. The bits ofa 
fundamental data type are numbered from right to left with zero and size - 1 used to identify the least and 
most significant bits, respectively. Fundamental data types larger than a single byte are stored in consecutive 
memory locations starting with the least-significant byte at the lowest memory address. This type of in- 
memory byte ordering is called little endian. Figure 1-1 illustrates the bit numbering and byte ordering 
schemes that are used by the fundamental data types. 


CHAPTER 1 |. X86-64 CORE ARCHITECTURE 


Bit Position 


€ 
6 


Byte 
Word 


Doubleword 


E 


| j i i i Double 
Quadword 


co: 


m: had N: m: H 

+i +! +} EH +3 H 

zi zi zi zi zi zi 
Memory Address 


Figure 1-1. Bit-numbering and byte-ordering for fundamental data types 


A properly-aligned fundamental data type is one whose address is evenly divisible by its size in bytes. 
For example, a doubleword is properly aligned when it's stored at a memory location with an address that 
is evenly divisible by four. Similarly, quadwords are properly aligned at addresses evenly divisible by eight. 
Unless specifically enabled by the operating system, an x86 processor does not require proper alignment 
of multi-byte fundamental data types in memory. However, it is standard practice to properly align all 
multi-byte values whenever possible in order to avoid potential performance penalties that can occur if the 
processor is required to access misaligned data in memory. 


Numerical Data Types 


A numerical data type is an elementary scalar value such as an integer or floating-point number. All 
numerical data types recognized by the CPU are represented using one of the fundamental data types 
discussed in the previous section. Table 1-2 contains a list of x86 numerical data types along with 
corresponding C/C++ types. This table also includes the fixed-size types that are defined in the C++ 

header file «cstdint» (see http: //www.cplusplus.com/reference/cstdint/ for more information 

about this header file). The x86-64 instruction set intrinsically supports arithmetic and logical operations 
using 8-, 16-, 32-, and 64-bit integers, both signed and unsigned. It also supports arithmetic calculations and 
data manipulation operations using single-precision and double-precision floating-point values. 
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Table 1-2. X86 Numerical Data Types 


Type Size (Bits) CIC++ Type <cstdint> 
Signed integers 8 char int8 t 

16 short int16 t 

32 int, long int32 t 

64 long long int64 t 
Unsigned integers 8 unsigned char uint8 t 

16 unsigned short uinti6 t 

32 unsigned int, unsigned long uint32 t 

64 unsigned long long uint64 t 
Floating-point 32 float Not applicable 

64 double Not applicable 


SIMD Data Types 


A SIMD data type is contiguous collection of bytes that's used by the processor to perform an operation 

or calculation using multiple values. A SIMD data type can be regarded as a container object that holds 
several instances of the same fundamental data type (e.g., bytes, words, double words, or quadwords). Like 
fundamental data types, the bits of a SIMD data type are numbered from right to left with zero and size - 1 
denoting the least and most significant bits, respectively. Little-endian ordering is also used when SIMD 
values are stored in memory, as illustrated in Figure 1-2. 
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Figure 1-2. SIMD data types 


Programmers can use SIMD (or packed) data types to perform simultaneous calculations using either 
integers or floating-point values. For example, a 128-bit wide packed data type can be used to hold sixteen 
8-bit integers, eight 16-bit integers, four 32-bit integers, or two 64-bit integers. A 256-bit wide packed data 
type can hold a variety of data elements including eight single-precision floating-point values or four 
double-precision floating-point values. Table 1-3 contains a complete list of the SIMD data types and the 
maximum number of elements for various numerical data types. 
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Table 1-3. SIMD Data Types and Maximum Number of Data Elements 


Numerical Type xmmword ymmword | zmmword 
8-bit integer 16 32 64 

16-bit integer 8 16 32 

32-bit integer 4 8 16 

64-bit integer 2 4 8 
Single-precision floating-point 4 8 16 
Double-precision floating-point 2 4 8 


As discussed earlier in this chapter, SIMD enhancements have been regularly added to the x86 platform 
starting in 1997 with MMX technology and most recently with the addition of AVX-512. This presents some 
challenges to the software developer who wants to exploit these technologies in that the packed data types 
described in Table 1-3 and their associated instruction sets are not universally supported by all processors. 
Fortunately, methods are available to determine at runtime the specific SIMD features and instruction sets 
that a processor supports. You'll learn how to use some of these methods in Chapter 16. 


Miscellaneous Data Types 


The x86 platform also supports a number of miscellaneous data types including strings, bit fields, and bit 
strings. An x86 string is contiguous block of bytes, words, doublewords, or quadwords. X86 strings are used 
to support text-based data types and processing operations. For example, the C/C++ data types char and 
wchar tare usually implemented using an x86 byte or word, respectively. X86 strings can also be employed 
to perform processing operations on arrays, bitmaps, and similar contiguous-block data structures. The 
x86 instruction set includes instructions that can carry out compare, load, move, scan, and store operations 
using strings. 

Other miscellaneous data types include bit fields and bit strings. A bit field is a contiguous sequence of 
bits and is used as a mask value by some instructions. A bit field can start at any bit position within a byte 
and contain up to 32 bits. A bit string is a contiguous sequence of bits containing up to 2? - 1 bits. The x86 
instruction set includes instructions that can clear, set, scan, and test individual bits within a bit string. 


Internal Architecture 


From the perspective of an executing program, the internal architecture of an x86-64 processor can be 
logically partitioned into several distinct units. These include the general-purpose registers, status and 
control flags (RFLAGS register), instruction pointer (RIP register), XMM registers, and floating-point control 
and status (MXCSR). By definition, an executing program uses the general-purpose registers, the RFLAGS 
register, and the RIP register. Program utilization of the XMM, YMM, ZMM, or MXCSR registers is optional. 
Figure 1-3 illustrates the internal architecture of an x86-64 processor. 
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Figure 1-3. X86-64 processor internal architecture 


All x86-64 compatible processors support SSE2 and include 16 128-bit XMM registers that programmers 
can use to perform scalar floating-point computations. These registers can also be employed to carry out 
SIMD operations using packed integers or packed floating-point values (both single precision and double 
precision). You'll learn how to use the XMM registers, the MXCSR register, and the AVX instruction set to 
perform floating-point calculations in Chapter 4 and 5. This chapter also discusses the YMM register set and 
other AVX architectural concepts in greater detail. You'll learn about AVX2 and AVX-512 in Chapters 8 and 12, 
respectively. 


General-Purpose Registers 


The x86-64 execution unit contains 16 64-bit general-purpose registers, which are used to perform 
arithmetic, logical, compare, data transfer, and address calculation operations. They also can be used as 
temporary storage locations for constant values, intermediate results, and pointers to data values stored in 
memory. Figure 1-4 shows the complete set of x86-64 general-purpose registers along with their instruction 
operand names. 
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Figure 1-4. X86-64 general-purpose registers 


The low-order doubleword, word, and byte of each 64-bit register are independently accessible and can 
be used to manipulate 32-bit, 16-bit, and 8-bit wide operands. For example, a function can use registers EAX, 
EBX, ECX, and EDX to perform 32-bit calculations in the low-order doublewords of registers RAX, RBX, RCX, 
and RDX, respectively. Similarly, registers AL, BL, CL, and DL can be used to carry out 8-bit calculations in 
the low-order bytes. It should be noted that a discrepancy exists regarding the names of some byte registers. 
The Microsoft 64-bit assembler uses the names shown in Figure 1-4, while the Intel documentation uses the 
names R8L - R15L. This book uses the Microsoft register names in order to maintain consistency between the 
text and the sample code. Not shown in Figure 1-4 are the legacy byte registers AH, BH, CH, and DH. These 
registers are aliased to the high-order bytes of registers AX, BX, CX, and DX, respectively. The legacy byte 
registers can be used in x86-64 programs, albeit with some restrictions, as described later in this chapter. 

Despite their designation as general-purpose registers, the x86-64 instruction set imposes some notable 
restrictions on how they can be used. Some instructions either require or implicitly use specific registers 
as operands. This is a legacy design pattern that dates back to the 8086 ostensibly to improve code density. 
For example, some variations of the imul (Signed Integer Multiplication) instruction save the calculated 
integer product to RDX:RAX, EDX:EAX, DX:AX, or AX (the colon notation signifies that the final product 
is contained in two registers, with the first register holding the high-order bits). The idiv (Signed Integer 
Division) instruction requires the integer dividend to be loaded in RDX:RAX, EDX:EAX, DX:AX, or AX. 

The x86 string instructions require that the addresses of the source and destination operands be placed in 
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registers RSI and RDI, respectively. String instructions that include a repeat prefix must use RCX as the count 
register, while variable-bit shift and rotate instructions must load the count value into register CL. 

The processor uses register RSP to support stack-related operations such as function calls and 
returns. The stack itself is simply a contiguous block of memory that is assigned to a process or thread by 
the operating system. Application programs can also use the stack to pass function arguments and store 
temporary data. The RSP register always points to the stack's top most item. Stack push and pop operations 
are performed using 64-bit wide operands. This means that the location of the stack in memory is usually 
aligned to an 8-byte boundary. Some runtime environments (e.g., 64-bit Visual C++ programs running on 
Windows) align stack memory and RSP to a 16-byte boundary in order to avoid improperly-aligned memory 
transfers between the XMM registers and 128-bit wide operands stored on the stack. 

While it is technically possible to use the RSP register as a general-purpose register, such use is 
impractical and strongly discouraged. Register RBP is typically used as a base pointer to access data items 
that are stored on the stack. RSP can also be used as a base pointer to access data items on the stack. When 
not employed as a base pointer, programs can use RBP as a general-purpose register. 


RFLAGS Register 


The RFLAGS register contains a series of status bits (or flags) that the processor uses to signify the results of 
an arithmetic, logical, or compare operation. It also contains a number of control bits that are primarily used 
by operating systems. Table 1-4 shows the organization ofthe bits in the RFLAGS register. 


Table 1-4. RFLAGS Register 


Bit Position Name Symbol Use 

0 Carry Flag CF Status 

1 Reserved 1 

2 Parity Flag PF Status 

9 Reserved 0 

4 Auxiliary Carry Flag AF Status 

5 Reserved 0 

6 Zero Flag ZF Status 

7 Sign Flag SF Status 

8 Trap Flag TF System 

9 Interrupt Enable Flag IF System 

10 Direction Flag DF Control 

11 Overflow Flag OF Status 

12 I/O Privilege Level Bit 0 IOPL System 

13 I/O Privilege Level Bit 1 IOPL System 

14 Nested Task NT System 

15 Reserved 0 
(continued) 
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Table 1-4. (continued) 


Bit Position Name Symbol Use 

16 Resume Flag RF System 
17 Virtual 8086 Mode VM System 
18 Alignment Check AC System 
19 Virtual Interrupt Flag VIF System 
20 Virtual Interrupt Pending VIP System 
21 ID Flag ID System 
22 - 63 Reserved 0 


For application programs, the most important bits in the RFLAGS register are the following status 
flags: carry flag (CF), overflow flag (OF), parity flag (PF) , sign flag (SF) , and zero flag (ZF). The carry flag 
is set by the processor to signify an overflow condition when performing unsigned integer arithmetic. It is 
also used by some register rotate and shift instructions. The overflow flag signals that the result of a signed 
integer operation is too small or too large. The processor sets the parity flag to indicate whether the least- 
significant byte of an arithmetic, compare, or logical operation contains an even number of 1 bits (parity bits 
are used by some communication protocols to detect transmission errors). The sign and zero flags are set by 
arithmetic and logical instructions to signify a negative, zero, or positive result. 

The RFLAGS register contains control bit called the direction flag (DF). An application program can 
set or reset the direction flag, which defines the auto increment direction (0 = low to high addresses, 1 = 
high to low addresses) of the RDI and RSI registers during execution of string instructions. The remaining 
bits in the RFLAGS register are used exclusively by the operating system to manage interrupts, restrict I/O 
operations, support program debugging, and handle virtual operations. They should never be modified by 
an application program. Reserved bits also should never be modified, and no assumptions should ever be 
made regarding the state of any reserved bit. 


Instruction Pointer 


The instruction pointer register (RIP) contains the logical address of the next instruction to be executed. The 
value in register RIP updates automatically during execution of each instruction. It is also implicitly altered 
during execution of control-transfer instructions. For example, the call (Call Procedure) instruction pushes 
the contents of the RIP register onto the stack and transfers program control to the address designated by the 
specified operand. The ret (Return from Procedure) instruction transfers program control by popping the 
top-most eight bytes off the stack and loading them into the RIP register. 

The jmp (Jump) and jcc (Jump if Condition is Met) instructions also transfer program control by 
modifying the contents of the RIP register. Unlike the call and ret instructions, all x86-64 jump instructions 
are executed independent of the stack. The RIP register is also used for displacement-based operand 
memory addressing as explained in the next section. It is not possible for an executing task to directly access 
the contents of the RIP register. 


Instruction Operands 


All x86-64 instructions use operands, which designate the specific values that an instruction will act upon. 
Nearly all instructions require one or more source operands along with a single destination operand. Most 
instructions also require the programmer to explicitly specify the source and destination operands. There 
are, however, a number of instructions where the register operands are either implicitly specified or required 
by an instruction, as discussed in the previous section. 
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There are three basic types of operands: immediate, register, and memory. An immediate operand 
is a constant value that is encoded as part of the instruction. These are typically used to specify constant 
values. Only source operands can specify an immediate value. Register operands are contained in a general- 
purpose or SIMD register. A memory operand specifies a location in memory, which can contain any of 
the data types described earlier in this chapter. An instruction can specify either the source or destination 
operand as a memory operand but not both. Table 1-5 contains several examples of instructions that employ 
the various operand types. 


Table 1-5. Examples of Basic Operand Types 


Type Example Analogous C/C++ Statement 
Immediate mov rax,42 rax = 42 
imul r12,-47 r12 *- -47 
shl r15,8 115 <<= 8 
xor ecx,80000000h ecx ^- 0x80000000 
sub r9b,14 r9b -- 14 
Register mov rax,rbx rax - rbx 
add rbx,r10 rbx += r10 
mul rbx rdx:rax - rax * rbx 
and r8w,0fF00h r8w &- Oxff00 
Memory mov rax,[r13] rax - *r13 
or rcx, [rbxersi*8] rcx |» *(rbx+rsi*8) 
sub qword ptr [r8],17 *(long long*)r8 -- 17 
shl word ptr [r12],2 *(short*)r12 <<= 2 


Themul rbx (Unsigned Multiply) instruction that is shown in Table 1-5 is an example of implicit 
operand usage. In this example, implicit register RAX and explicit register RBX are used as the source 
operands, and implicit register pair RDX:RAX is the destination operand. The multiplicative product's high- 
order and low-order quadwords are stored in RDX and RAX, respectively. 

In Table 1-5's penultimate example, the text quord ptr is an assembler operator that acts like a C/C++ 
cast operator. In this instance, the value 17 is subtracted from a 64-bit value whose memory location is 
specified by the contents of register R8. Without the quord ptr operator, the assembly language statement 
is ambiguous since the assembler can't ascertain the size of the operand pointed to by R8. In this example, 
the destination could also an 8-bit, 16-bit, or 32-bit sized operand. The final example in Table 1-5 uses the 
word ptr operator in a similar manner. You'll learn more about assembler operators and directives in the 
programming chapters of this book. 


Memory Addressing 


An x86-64 instruction requires up to four separate components in order to specify the location of an operand 
in memory. The four components include a constant displacement value, a base register, an index register, 
and a scale factor. Using these components, the processor calculates an effective address for a memory 
operand as follows: 


EffectiveAddress = BaseReg + IndexReg * ScaleFactor + Disp 
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The base register (BaseReg) can be any general-purpose register. The index register (IndexReg) can be 
any general-purpose register except RSP. Valid scale factors (ScaleFactor) include 2, 4, and 8. Finally, the 
displacement (Disp)is a constant 8-bit, 16-bit, or 32-bit signed offset that's encoded within the instruction. 
Table 1-6 illustrates x86-64 memory addressing using different forms of the mov (Move) instruction. In these 
examples, register RAX (the destination operand) is loaded with the quadword value that's specified by the 
source operand. Note that it is not necessary for an instruction to explicitly specify all of the components 
required for an effective address. For example, a default value of zero is used for the displacement if an 
explicit value is not specified. The final size of an effective address calculation is always 64 bits. 


Table 1-6. Memory Operand Addressing 


Addressing Form Example 

RIP + Disp mov rax, [Val] 

BaseReg mov inm ] 

BaseReg + Disp mov rax, [rbx«16] 
IndexReg * SF + Disp mov rax, [r15*8448] 
BaseReg + IndexReg mov rax, [rbx«r15] 
BaseReg + IndexReg + Disp mov rax, [rbx«r15432] 
BaseReg + IndexReg * SF mov rax, [rbx«r15*8] 
BaseReg + IndexReg * SF + Disp mov rax, [rbx+r15*8+64 ] 


The memory addressing forms shown in Table 1-6 are used to directly reference program variables and 
data structures. For example, the simple displacement form is often used to access a simple global or static 
variable. The base register form is analogous to a C/C++ pointer and is used to indirectly reference a single 
value. Individual fields within a data structure can be retrieved using a base register and a displacement. The 
index register forms are useful for accessing individual elements within an array. Scale factors can reduce 
the amount code needed to access the elements of an array that contains integer or floating-point values. 
Elements in more elaborate data structures can be referenced by using a base register together with an index 
register, scale factor, and displacement. 

The mov rax, [Val] instruction that's shown in the first row of Table 1-6 is an example of RIP-relative (or 
instruction pointer relative) addressing. With RIP-relative addressing, the processor calculates an effective 
address using the contents of the RIP register and a signed 32-bit displacement value that's encoded within 
the instruction. Figure 1-5 illustrates this calculation in greater detail. Note the little endian ordering 
of the displacement value that's embedded in the mov rax, [Val] instruction. RIP-relative addressing 
allows the processor to reference global or static operands using a 32-bit displacement instead of a 64-bit 
displacement, which reduces required code space. It also facilitates position-independent code. 
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Memory Address Machine Code Instruction 

00007FF778A06490 488B05F9670000 mov rax,[Val] 
RIP ——* 00007FF778A06497— 48 03 C1 add rax,rcx 

00007FF778A0649A | 4828 C2 sub rax,rdx 


00007FF778A0CC90 88 13 00 00 00 00 00 00 
Val = 5000 


Figure 1-5. RIP-relative effective address calculation 


One minor constraint of RIP-relative addressing is that the target operand must reside with a +2 GB 
address window of the value that's contained in register RIP. For most programs, this limitation is rarely a 
concern. The calculation of a RIP-relative displacement value is automatically determined by the assembler 
during code generation. This means that you can use amov rax, [Val] or similar instructions without 
having to worry about the details of the displacement value calculation. 


Differences Between x86-64 and x86-32 Programming 


There are some important differences between x86-64 and x86-32 assembly language programming. If you 
are learning x86 assembly language programming for the first time, you can either skim or skip this section 
since it discusses concepts that aren't fully explained until later in this book. 

Most existing x86-32 instructions have an x86-64 equivalent instruction that enables a function to exploit 
64-bit wide addresses and operands. X86-64 functions can also perform calculations using instructions that 
manipulate 8-bit, 16-bit, or 32-bit registers and operands. Except for the mov instruction, the maximum size 
of an x86-64 mode immediate value is 32 bits. If an instruction manipulates a 64-bit wide register or memory 
operand, any specified 32-bit immediate value is signed-extended to 64 bits prior to its use. 

Table 1-7 contains some examples of x86-64 instructions using various operand sizes. Note that the 
memory operands in these example instructions are referenced using 64-bit registers, which is required 
in order to access the entire 64-bit linear address space. While it is possible in x86-64 mode to reference a 
memory operand using a 32-bit register (e.g., mov r10, [eax]), the location of the operand must reside in 
the low 4 GB portion of the 64-bit effective address space. Using 32-bit registers to access memory operands 
in x86-64 mode is not recommended since it introduces unnecessary and potentially dangerous code 
obfuscations. It also complicates software testing and debugging. 


Table 1-7. Examples of x86-64 Instructions Using Various Operand Sizes 


8-Bit 16-Bit 32-Bit 64-Bit 

add al,bl add ax,bx add eax,ebx add rax,rbx 
cmp dl, [r15] cmp dx, [r15] cmp edx,[r15] cmp rdx,[r15] 
mul r10b mul r10w mul r10d mul r10 

or [r8+rdi],al or [r8+rdil],ax or [r8+rdi], eax or [r8«rdi],rax 
shl r9b,cl shl r9w,cl shl r9d,cl shl r9,cl 
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The aforementioned immediate value size limitation warrants some extra discussion since it 
sometimes affects the instruction sequences that a program must use to carry out certain operations. 
Figure 1-6 contains a few examples of instructions that use a 64-bit register with an immediate operand. In 
the first example, the mov rax, 100 instruction loads an immediate value into the RAX register. Note that 
the machine code uses only 32 bits to encode the immediate value 100, which is underlined. This value is 
signed extended to 64 bits and saved in RAX. The add rax,200 instruction that follows also sign extends 
its immediate value prior to performing the addition. The next example opens with a mov rcx, -2000 
instruction that loads a negative immediate value into RCX. The machine code for this instruction also uses 
32 bits to encode the immediate value - 2000, which is signed extended to 64 bits and saved in RCX. The 
subsequent add rcx,1000 instruction yields a 64-bit result of - 1000. 


Machine Code Instruction DesOp Result 
48 C7 CO 64 00 00 00 mov rax,100 0000000000000064h 
48 05 CO C8 00 00 00 add rax, 200 000000000000012Ch 


48 C7 C1 30 F8 FF FF mov rcx,-2000 FFFFFFFFFFFFF830h 
48 81 C1 E8 03 00 00 add rcx, 1000 FFFFFFFFFFFFFC18h 


48 C7 C2 FF 00 00 00 mov rdx,Offh 00000000000000FFh 
48 81 CA 00 00 00 80 or rdx,80000000h  FFFFFFFF800000FFh 


48 C7 C2 FF 00 00 00 mov rdx,Offh 00000000000000FFh 
49 B8 00 00 00 80 00 00 00 00 movr8,80000000h 0000000080000000h 
49 0B DO or rdx,r8 00000000800000FFh 


Figure 1-6. Using 64-bit registers with immediate operands 


The third example employs amov rdx,Offh instruction to initialize register RDX. This is followed by an 
or rdx,80000000h instruction that sign extends the immediate value 0x80000000 to OxFFFFFFFF80000000, 
and then performs a bitwise inclusive OR operation. The value that's shown for RDX is almost certainly 
not the intended result. The final example illustrates how to carry out an operation that requires a 64-bit 
immediate value. A mov r8,80000000h instruction loads the 64-bit value 0x0000000080000000 into R8. As 
mentioned earlier in this section, the mov instruction is the only instruction that supports 64-bit immediate 
operands. Execution ofthe ensuing or rdx,r8 instruction yields the expected value. 

The 32-bit size limitation for immediate values also applies to jmp and call instructions that specify 
relative-displacement targets. In these cases, the target (or location) of a jmp or call instruction must reside 
with a +2 GB address window of the current RIP register. Targets whose relative displacements exceed 
this window can only be accessed using a jmp or call instruction that employs an indirect operand (e.g., 
jmp qword ptr [FuncPtr] or call rax). Like RIP-relative addressing, the size limitations described in this 
paragraph are unlikely to present significant obstacles for most assembly language functions. 

Another difference between x86-32 and x86-64 assembly language programming is the effect that 
some instructions have on the upper 32 bits of a 64-bit general-purpose register. When using instructions 
that manipulate 32-bit registers and operands, the high-order 32 bits of the corresponding 64-bit general- 
purpose register are zeroed during execution. For example, assume that register RAX contains the value 
0x8000000000000000. Execution of the instruction add eax, 10 generates a result of 0X000000000000000A 
in RAX. However, when working with 8-bit or 16-bit registers and operands, the upper 56 or 48 bits ofthe 
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corresponding 64-bit general-purpose register are not modified. Assuming again that if RAX contains 
0x8000000000000000, execution of the instructions add al,20 or add ax,40 would yield RAX values of 
0x8000000000000014 or 0x8000000000000028, respectively. 

The x86-64 platform imposes some restrictions on the use of legacy registers AH, BH, CH, and DH. 
These registers cannot be used with instructions that also reference one of the new 8-bit registers (i.e., 
SIL, DIL, BPL, SPL, and R8B - 15B). Existing x86-32 instructions such as mov ah,bl and add dh,b1 are still 
allowed in x86-64 programs. However, the instructions mov ah,r8b and add dh,r8b are invalid. 


Invalid Instructions 


A handful of rarely used x86-32 instructions are cannot be used in x86-64 programs. Table 1-8 lists these 
instructions. Somewhat surprisingly, early-generation x86-64 processors did not support the lahf and 
sahf instructions in x86-64 mode (they still worked in x86-32 mode). Fortunately, these instructions were 
reinstated, and should be available in most AMD and Intel processors marketed since 2006. A program can 
confirm processor support for the lahf and sahf instructions in x86-64 mode by testing the cupid feature 
flag LAHF-SAHF. 


Table 1-8. X86-64 Mode Invalid Instructions 


Mnemonic Name 

aaa ASCII Adjust After Addition 

aad ASCII Adjust After Division 

aam ASCII Adjust After Multiplication 
aas ASCII Adjust After Subtraction 
bound Check Array Index Against Bounds 
daa Decimal Adjust After Addition 

das Decimal Adjust After Subtraction 
into Generate interrupt if RFLAGS.OF Equals 1 
pop[a|ad] Pop all General-Purpose Registers 
push[a|ad] Push all General-Purpose Registers 


Deprecated Instructions 


Processors that support the x86-64 instruction set also include the computational resources of SSE2. This 
means that x86-64 programs can safely use the packed integer instructions of SSE2 instead of MMx. It also 
means that x86-64 programs can use SSE2's (or AVX's if available) scalar floating-point instructions instead 
of x87 FPU instructions. X86-64 programs can still take advantage of the MMX and x87 FPU instruction sets, 
and such use might be warranted when migrating x86-32 legacy code to the x64-64 platform. For new x86-64 
software development, however, using the MMX and x87 FPU instruction sets is not recommended. 


Instruction Set Overview 


Table 1-9 lists in alphabetical order the core x86-64 instructions that are frequently used in assembly 
language functions. For each instruction mnemonic, there is a deliberately succinct description since 
comprehensive details of each instruction including execution particulars, valid operands, affected flags, 
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and exceptions are readily available in reference manuals published by AMD and Intel. Appendix A contains 
a list of these manuals. The programming examples in Chapters 2 and 3 also contain additional information 
regarding proper use of these instructions. 

Note that Table 1-9 uses brackets in the mnemonics column to represent distinct variations of a 
common instruction. For example, bs[f |r] denotes the distinct instructions bsf (Bit Scan Forward) and bsr 
(Bit Scan Reverse). 


Table 1-9. Overview of Core X86-64 Instructions 


Mnemonic Instruction Name 

adc Integer addition with carry 

add Integer addition 

and Bitwise AND 

bs[f|r] Bit scan forward, bit scan reverse 
b[t|tr|ts] Bit test; Bit test and reset; Bit test and set 
call Call procedure 

cld Clear direction flag (RFLAGS.DF) 
cmovcc Conditional move 

cmp Compare operands 


emps[b |w|d|q] 


Compare string operands 


cupid Query CPU identification and feature information 
c[wd|dq|do] Convert operand 

dec Decrement operand by 1 

div Unsigned integer division 

idiv Signed integer division 

imul Signed integer multiplication 
inc Increment operand by 1 

jcc Conditional jump 

jmp Unconditional jump 

lahf Load status flags into register AH 
lea Load effective address 


lods[b|w|d|q] 
mov 


mov[sx|sxd] 


Load string operand 
Move data 


Move integer with sign extension 


movzx Move integer with zero extension 
mul Unsigned integer multiplication 
neg Two's complement negation 

not One's complement negation 

or Bitwise inclusive OR 
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Mnemonic Instruction Name 

pop Pop top-of-stack value to operand 

popfq Pop top-of-stack value to RFLAGS 

push Push operand onto stack 

pushfq Push RFLAGS onto stack 

rc[1|r] Rotate left with RFLAGS.CF; Rotate right with 
RFLAGS.CF 

ret Return from procedure 

re[p|pe|pz|pne|pnz] Repeat string operation (instruction prefix) 

ro[1|r] Rotate left; Rotate right 

sahf Store AH into status flags 

sar Shift arithmetic right 

setcc Set byte on condition 

sh[1|r] Shift logical left; Shift logical right 

sbb Integer subtraction with borrow 

std Set direction flag (RFLAGS.DF) 

stos[b|w|d|q] Store string value 

test Test operand (sets status flags) 

xchg Exchange source and destination operand values 

xor Bitwise exclusive OR 


Most arithmetic and logical instructions update one or more of the status flags in the RFLAGS register. 
As discussed earlier in this chapter, the status flags provide additional information about the results of an 
operation. The jcc, cmovcc, and setcc instructions use what are called condition codes to test the status 
flags either individually or in multiple-flag combinations. Table 1-10 lists the condition codes, mnemonic 
suffixes, and the corresponding RFLAGS tested by these instructions. 


Table 1-10. Condition Codes, Mnemonic Suffixes, and Test Conditions 


Condition Mnemonic RFLAGS Test 

Code Suffix Condition 

Above A CF == 0 8& ZF == 0 
Neither below nor equal NBE 

Above or equal AE CF == 0 

Not below NB 

Below B CF == 1 

Neither above nor equal NAE 

Below or equal BE CF == 1 || ZF == 1 
Not above NA 


(continued) 
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Table 1-10. (continued) 


Condition Mnemonic RFLAGS Test 
Code Suffix Condition 
Equal E ZF == 1 
Zero Z 

Not equal NE ZF == 0 

Not zero NZ 

Greater G ZF == 0 && SF == OF 
Neither less nor equal NLE 

Greater or equal GE SF == OF 

Not less NL 

Less L SF !- OF 
Neither greater nor equal NGE 

Less or equal LE ZF == 1 || SF != OF 
Not greater NG 

Sign S SF == 1 

Not sign NS SE == 0 
Carry C CF == 1 

Not carry NC CF == 0 
Overflow O OF == 1 

Not overflow NO OF == 0 
Parity P PF == 1 
Parity even PE 

Not parity NP PF == 0 
Parity odd PO 


The alternate forms of many Table 1-10 mnemonics are defined to provide algorithmic flexibility or 
improve program readability. When using one of the aforementioned conditional instructions in source 
code, condition-codes containing the words "above" and "below" are employed for unsigned-integer 
operands while the words "greater" and "less" are used for signed-integer operands. If the contents of 
Table 1-10 seem a little confusing or abstract, don't worry. You'll see a plethora of condition code 
examples in subsequent chapters of this book. 


Summary 


Here are the key learning points of Chapter 1: 


e The fundamental data types of the x86-64 platform include bytes, words, 
doublewords, quadwords, and double quadwords. Intrinsic programming language 
data types such as characters, text strings, integers, and floating-point values are 
derived from the fundamental data types. 


e The x86-64 execution unit includes 16 64-bit general-purpose registers that are used 
to perform arithmetic, logical, and data transfer operations using 8-bit, 16-bit, 32-bit 
and 64-bit operands. 
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The x86-64 execution unit includes 16 128-bit XMM registers that can be used to 
perform scalar floating-point arithmetic using single-precision or double-precision 
values. These registers can also be employed to carry out SIMD operations using 
packed integers or packed floating-point values. 


Most x86-64 assembly language instructions can be used with the following explicit 
operand types: immediate, register, and memory. Some instructions employ implicit 
registers as their operands. 


An operand in memory can be referenced using a variety of addressing modes that 
include one or more of the following components: fixed displacement, base register, 
index register, and/or scale factor. 


Most arithmetic and logical instructions update one or more of the status flags in 
the RFLAGS register. These flags can be tested to alter program flow or conditionally 
assign values to variables. 
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CHAPTER 2 


X86-64 Core Programming - Part 1 


In the previous chapter, you learned about the fundamentals of the x86-64 platform including its data types, 
register sets, memory addressing modes, and the core instruction set. In this chapter, you learn how to code 
basic x86-64 assembly language functions that are callable from C++. You also learn about the semantics and 
syntax of an x86-64 assembly language source code file. The sample source code and accompanying remarks 
of this chapter are intended to complement the instructive material presented in Chapter 1. 

The content of Chapter 2 is organized as follows. The first section describes how to code functions 
that perform simple integer arithmetic such as addition and subtraction. You also learn the basics of 
passing arguments and return values between functions written in C++ and x86-64 assembly language. 
The next section highlights additional arithmetic instructions including integer multiplication and 
division. In the final section, you learn how to reference operands in memory and use conditional jumps 
and conditional moves. 

It should be noted that the primary purpose of the sample code presented in this chapter is to elucidate 
proper use of the x86-64 instruction set and basic assembly language programming techniques. All of the 
assembly language code is straightforward, but not necessarily optimal since understanding optimized 
assembly language code can be challenging especially for beginners. The sample code that's discussed in 
later chapters places more emphasis on efficient coding techniques. Chapter 15 also examines techniques 
that you can use to improve the efficiency of your assembly language code. 


Simple Integer Arithmetic 


In this section, you learn the basics of x86-64 assembly language programming. It begins with a simple 
program that demonstrates how to perform integer addition and subtraction. This is followed by an example 
program that illustrates use of the logical instructions and, or, and xor. The final program describes how to 
execute shift operations. All three programs illustrate passing argument and return values between a C++ 
and assembly language function. They also show how to employ commonly-used assembler directives. 

As mentioned in the Introduction, all of the sample code discussed in this book was created using 
Microsoft's Visual C++ and Macro Assembler (MASM), which are included with Visual Studio. Before taking 
a look at the first code example, a few instructive comments about these development tools may be helpful. 
Visual Studio uses entities called solutions and projects to help simplify application development. A solution 
is a collection of one or more projects that are used to build an application. Projects are container objects 
that help organize an application's files including (but not limited to) source code, resources, icons, bitmaps, 
HTML, and XML. A Visual Studio project is usually created for each buildable component (e.g., executable 
file, dynamic-linked library, static library, etc.) of an application. You can open and load a chapter's sample 
programs into the Visual Studio development environment by double-clicking on its solution (.sin) file. 
Appendix A contain additional information regarding the use of Visual C++ and MASM. 
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Note All of the source code examples in this book include one or more functions written in x86-64 
assembly language plus some C++ code that demonstrates how to invoke the assembly language code. The 
C++ code also contains ancillary functions that perform required initializations and display results. For each 
source code example, a single listing that includes both the C++ and assembly language source code is used in 
order to minimize the number of listing references in the main text. The actual source code uses separate files 
for the C++ (.cpp) and assembly language (.asm) code. 


Addition and Subtraction 


The first source code example of this chapter is called Ch02 01. This example demonstrates how to use the 
x86-64 assembly language instructions add (Integer Add) and sub (Integer Subtract). It also illustrates some 
basic assembly language programming concepts including argument passing, returning values, and how to 
use a few MASM assembler directives. Listing 2-1 shows the source code for example Ch02 01. 


Listing 2-1. Example Ch02 01 


#include "stdafx.h" 
#include <iostream> 


using namespace std; 

extern "C" int IntegerAddSub (int a, int b, int c, int d); 

static void PrintResult(const char* msg, int a, int b, int c, int d, int result) 
( 


const char nl = ‘\n'; 


cout «« msg «« nl; 


cout << "a=" << a << nl; 
cout << "b=" << b << nl; 
cout << "c=" << c << nl; 
cout << "d=" << d << nl; 
cout << "result = " << result << nl; 
cout << nl; 
} 
int main() 
{ 


int a, b, c, d, result; 
a = 10; b = 20; c = 30; d = 18; 


result = IntegerAddSub (a, b, c, d); 
PrintResult("Test 1", a, b, c, d, result); 
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a = 101; b = 34; c = -190; d = 25; 
result = IntegerAddSub (a, b, c, d); 
PrintResult("Test 2", a, b, c, d, result); 


return 0; 


; extern "C" int IntegerAddSub (int a, int b, int c, int d); 


.code 
IntegerAddSub proc 


; Calculate à + b«c-d 


mov eax,ecx ;eax = a 

add eax,edx ;eax = a + b 

add eax,r8d jeax = a +b+c 

sub eax,r9d seax = a +b+c-d 

ret ;return result to caller 
IntegerAddSub  endp 

end 


The C++ code in Listing 2-1 is mostly straightforward but includes a few lines that warrant some 
explanatory comments. The #include "stdafx.h" statement specifies a project-specific header file that 
contains references to frequently used system items. Visual Studio automatically generates this file whenever 
a new C++ project is created. The line extern "C" int IntegerAddSub (int a, int b, int c, int d) 
is a declaration statement that defines the parameters and return value for the x86-64 assembly language 
function IntegerAddSub (allassembly language function names and public variables used in this book 
include a trailing underscore for easier recognition). The declaration statement's "C" modifier instructs 
the C++ compiler to use C-style naming for function IntegerAddSub instead of a C++ decorated name (a 
C++ decorated name includes extra characters that help support function overloading). It also notifies the 
compiler to use C-style linkage for the specified function. 

The C++ function main contains the code that calls the assembly language function IntegerAddSub . 
This function requires four arguments of type int and returns a single int value. Like many programming 
languages, Visual C++ uses a combination of processor registers and the stack to pass argument values to a 
function. In the current example, the C++ compiler generates code that loads the values of a, b, c, and d into 
registers ECX, EDX, R8D, and R9D, respectively, prior to calling function IntegerAddSub . 

In Listing 2-1 the x86-64 assembly language code for example Ch02 01 is shown immediately after the C++ 
function main. The first thing to notice are the lines that begin with a semicolon. These are comments lines. 
MASM treats any text that follows a semicolon as comment text. The . code statement is a MASM directive 
that defines the start of an assembly language code section. A MASM directive is a statement that instructs the 
assembler how to perform certain actions. You'll learn how to use additional directives throughout this book. 

The IntegerAddSub_ proc statement defines the start of the assembly language function. Toward the 
end of Listing 2-1, the IntegerAddSub  endp statement marks the end of the function. Like the . code line, 
the proc and endp statements are not executable instructions but assembler directives that signify the start 
and end of an assembly language function. The final end statement is a required assembler directive that 
indicates the completion of statements for the assembly language file. The assembler ignores any text that 
appears after the end directive. 
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The assembly language function IntegerAddSub calculates a + b + c - dand returns this value 
to the calling C++ function. It begins with a mov eax, ecx (Move) instruction that copies the value a from 
ECX into EAX. Note that the contents of ECX are not altered by the mov instruction. Following execution 
of this mov instruction, registers EAX and ECX both contain the value a. The add eax,edx instruction adds 
the values in registers EAX and EDX. It then saves the sum (ora + b) in register EAX. Like the previous mov 
instruction, the contents of register EDX are not modified by the add instruction. The next instruction, add 
eax,r8d computes a + b + c. This is followed by a sub eax, r9d instruction that calculates the final value a 
+b+c-d. 

An x86-64 assembly language function must use register EAX to return a single 32-bit integer (or C++ 
int) value toits calling function. In the current example, no additional instructions are necessary to achieve 
this requirement since EAX already contains the correct return value. The final ret (Return from Procedure) 
instruction transfers control back to the calling function main, which displays the result. Here's the output for 
example Ch02 01. 


Test 1 

a = 10 

b = 20 

c = 30 

d = 18 
result = 42 
Test 2 

a = 101 
b = 34 

c = -190 

d = 25 
result = -80 


Logical Operations 


The next source code example is called ChO2_02. This example illustrates use of the x86-64 instructions and 
(Logical AND), or (Logical Inclusive OR), and xor (Logical Exclusive OR). It also shows how to access a C++ 
global variable from an assembly language function. Listing 2-2 shows the source code for Example Ch02_02. 


Listing 2-2. Example Ch02_02 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 


using namespace std; 


extern "C" unsigned int IntegerLogical (unsigned int a, unsigned int b, unsigned int c, 
unsigned int d); 


extern "C" unsigned int g Val1 = 0; 
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{ 


} 


void PrintResult(const char* s, unsigned int a, unsigned int b, unsigned int 
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// Calculate (((a 8 b) | c ) ^ d) + g Vali 


unsi 
unsi 
unsi 


gned int ti 
gned int t2 
gned int t3 


-a&b; 
=: 1. | €; 
= €2 ^u 


unsigned int result = t3 + g Val1; 


retu 


rn result; 


d, unsigned vali, unsigned int r1, unsigned int r2) 


{ 


int 


cons 
cons 


cout 
cout 
cout 
cout 
cout 
cout 
cout 
cout 
cout 
cout 


if ( 


main 


unsi 


Qoo cn c o 
M 


| Va 


"\n'; 


<< 
<< 
<< 
<< 
<< 
<< 
<< 


t int w= 8; 

t char nl = 
<< s << nl; 
<< setfill('0'); 
<< "a = ox" 
<< "b = ox" 
«€ "Ce ox" 
<< "d= ox" 
<< "vali = Ox" 
<< "rı = Ox" 
<< "r2 = Ox" 
<< nl; 

r1 !- r2) 


hex 
hex 
hex 
hex 
hex 
hex 
hex 


<< 
<< 
<< 
<< 
<< 
<< 
<< 


setw(w) 
setw(w) 
setw(w) 
setw(w) 
setw(w) 
setw(w) 
setw(w) 


cout «« "Compare failed" «« nl; 


0 
gned int a, 


0x00223344; 
0x00775544; 
0x00555555; 
0x00998877; 
11-27; 


b, €, d, r2, r2 - 0; 


r1 - IntegerlogicalCpp(a, b, c, d); 
r2 - Integerlogical (a, b, c, d); 
PrintResult("Test 1", a, b, c, d, g Vali, r1, 12); 


a 
b 
c 
d 
g_Va 
r 


0x70987655; 
0x55555555; 
OxAAAAAAAA ; 
0x12345678; 
l1 = 23; 


1 = IntegerLogicalCpp(a, b, c, d); 


<< 
<< 
<< 
<< 
<< 
<< 
<< 


a << " (" << dec << a << ")" << nl; 
b << " (" << dec << b << ")" << nl; 
c << " (" << dec << c << ")" << nl; 
d << " (" << dec << d << ")" « nl; 


vali << " (" << dec << valax< ")" << nl; 


r1 << " (" << dec << r1 << ")" << nl; 
r2 << " (" << dec << r2 << ")" << nl; 


int b, unsigned int c, unsigned int d) 


c, unsigned int 
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r2 = Integerlogical (a, b, c, d); 
PrintResult("Test 2", a, b, c, d, g Vali, r1, 12); 


return 0; 


P mm 
H Cho2_02.asm 

P mm 
; extern "C" unsigned int IntegerLogical (unsigned int a, unsigned int b, unsigned int c, 
unsigned int d); 


extern g Vali:dword ;external doubleword (32-bit) value 


.code 
Integerlogical proc 


; Calculate (((a 8 b) | c ) ^ d) + g Val1 


and ecx,edx jecx = a & b 
or ecx,r8d jecx = (a& b) | c 
xor ecx,r9d secx = ((a& b) | c) ^d 
add ecx,[g Val1] ;ecx = (((a & b) | c) ^ d) + g Val 
mov eax,ecx ;eax = final result 
ret ;return to caller 
Integerlogical endp 
end 


Similar to what you saw in the first example, the declaration of assembly language function 
Integerlogical uses the "C" modifier to instruct the C++ compiler not to generate a decorated name 
for this function. Omitting this modifier would result in a link error during program build. (If the "C" 
modifier is omitted from the current example, Visual C++ 2017 uses the decorated function name 
?IntegerlLogical @@YAIIIII@Z instead of IntegerLogical_. Decorated names are derived using the 
function's argument types, and these names are compiler specific.) Function IntegerLogical_ requires 
four unsigned int arguments and returns a single unsigned int result. Immediately following the 
declaration of function IntegerLogical_is the definition of a global unsigned int variable named g_Val1. 
This variable is defined to demonstrate how to access a global value from an assembly language function. 
Like function declarations, use of the "C" modifier for g Val1 instructs the compiler to use C-style naming 
instead of a decorated C++ name. 

The definition of function IntegerLogicalCpp follows next in the C++ source code. The reason for 
defining this function is to provide a simple method for determining whether or not the corresponding 
x86-64 assembly language function IntegerLogical_ calculates the correct result. While overkill for this 
particular example, coding complex functions using both C++ and assembly language is often helpful 
for software test and debugging purposes. The function main in Listing 2-2 includes code that calls both 
IntegerLogicalCpp and IntegerLogical_. It also calls the function PrintResult to display the results. 

In Listing 2-2 the x86-64 assembly language code for example Ch02_02 follows the C++ function main. 
The first assembly language source code statement, extern g Val1:dword, is the MASM equivalent of the 
corresponding declaration for g Val that’s used in the C++ code. In this instance, the extern directive 
notifies the assembler that storage space for the variable g Val1 is defined in another module, and the dword 
directive indicates that g Val1 is a doubleword (or 32-bit) unsigned value. 
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Similar to the example in the previous section, the arguments a, b, c, and d are passed to function 
IntegerLogical using registers ECX, EDX, R8D, and R9D. The and ecx,edx instruction performs a 
bitwise AND operation using the values in registers ECX and EDX, and saves the result to register ECX. The 
or ecx,r8d and xor ecx,r9d instructions carry out bitwise inclusive OR and exclusive OR operations, 
respectively. The add ecx,[g Val1] instruction adds the contents of register ECX and the value of global 
variable g Vali, and saves the resultant sum to register ECX. A mov eax,ecx copies the final result to register 
EAX so that it can be passed back to the calling function. Here's the output for example Ch02 02. 


Test 1 

a = 0x00223344 (2241348) 
b = 0x00775544 (7820612) 
C =  OX00555555 (5592405) 
d =  Ox00998877 (10061943) 


vali = 0x00000007 (7) 
rı = OX00eedd29 (15654185) 
I2 =  Ox00eedd29 (15654185) 


Test 2 

a = 0x70987655 (1889039957) 
b = 0x55555555 (1431655765) 
ce Oxaaaaaaaa (2863311530) 
d = 0x12345678 (305419896) 


vali = 0x00000017 (23) 
I1 = Oxe88ea89e (3901663390) 
I2 = Oxe88ea89e (3901663390) 


Shift Operations 


The last source code example of this section, which is similar in form to the previous two examples, 
demonstrates use of the shl (Shift Logical Left) and shr (Shift Logical Right) instructions. It also illustrates 
use of a few more frequently used instructions including cmp (Compare), ja (Jump if Above), and xchg 
(Exchange). Listing 2-3 shows the C++ and assembly language source code for example Ch02 03. 


Listing 2-3. Example Ch02 03 


#include "stdafx.h" 
include <iostream> 
#include «iomanip» 
#include <bitset> 


using namespace std; 


extern "C" int IntegerShift (unsigned int a, unsigned int count, unsigned int* a shl, 
unsigned int* a shr); 
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static void PrintResult(const char* s, int rc, unsigned int a, unsigned int count, 
int a shl, unsigned int a shr) 


{ 
bitset«32» a bs(a); 
bitset«32» a shl bs(a shl); 
bitset«32» a shr bs(a shr); 
const int w - 10; 
const char nl = ‘\n'; 
cout << s << '\n'; 
cout << "count =" << setw(w) << count << nl; 
cout << "a = " << setw(w) << a << " (Ob" << a bs << ")" << nl; 
if (rc == 0) 
cout << “Invalid shift count" << nl; 
else 
{ 
cout << "shl = " << setw(w) << a shl << " (Ob" << a shl bs << ")" << nl; 
cout << "shr = " << setw(w) << a shr << " (Ob" << a shr bs << ")" << nl; 
) 
cout «« nl; 
} 
int main() 
{ 
int rc; 
unsigned int a, count, a shl, a shr; 
a - 3119; 
count - 6; 
rc = IntegerShift (a, count, &a shl, 8a shr); 
PrintResult("Test 1", rc, a, count, a shl, a shr); 
a = 0x00800080; 
count = 4; 
rc = IntegerShift (a, count, 8a shl, 8a shr); 
PrintResult("Test 2", rc, a, count, a shl, a shr); 
a - 0x80000001; 
count - 31; 
rc = IntegerShift (a, count, &a shl, 8a shr); 
PrintResult("Test 3", rc, a, count, a shl, a shr); 
a = 0x55555555; 
count = 32; 
rc = IntegerShift (a, count, 8a shl, 8a shr); 
PrintResult("Test 4", rc, a, count, a shl, a shr); 
return 0; 
} 
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; extern "C" int IntegerShift (unsigned int a, unsigned int count, unsigned int* a shl, 
unsigned int* a shr); 


; Returns: O - error (count »- 32), 1 - success 
5 
.code 
IntegerShift_ proc 
XOI eax,eax ;set return code in case of error 
cmp edx,31 ;compare count against 31 
ja InvalidCount ;jump if count > 31 
xchg ecx,edx ;exchange contents of ecx & edx 
mov eax,edx ;eax = a 
shl eax,cl ;eax = a << count; 
mov [r8],eax ;save result 
shr edx,cl ;edx = a >> count 
mov [r9],edx ;save result 
mov eax,1 ;set success return code 
InvalidCount: 
ret ;return to caller 


IntegerShift  endp 
end 


Near the top of the C++ code, the declaration of the x86 assembly language function IntegerShift_ 
is somewhat different than the previous examples in that it defines two pointer arguments. Pointers 
are used by this function since it needs to return more than one result to its calling function. The other 
minor difference is that the int return value from IntegerShift is used to indicate whether or not the 
value of count is valid. The remaining C++ code in Listing 2-3 exercises the assembly language function 
IntegerShift using a few test cases and displays results. 

The assembly language code of function IntegerShift starts with an xor eax,eax instruction that 
sets register EAX to zero. This is done to ensure that register EAX contains the correct return code should 
an invalid value for argument count be detected. The next instruction, cmp edx, 31, compares the contents 
of register EDX, which contains count, to the constant value 31. When the processor performs a compare 
operation, it subtracts the second operand from the first operand, sets the status flags based on the results 
of this operation, and discards the result. If the value of count is above 31, the ja InvalidCount instruction 
performs a jump to the program location specified by the destination operand. If you look ahead a few 
lines, you will notice a statement with the text InvalidCount:. This text is called a label. If count > 31is 
true, the ja InvalidCount instruction transfers program control to the first assembly language instruction 
immediately following the label InvalidCount. Note that this instruction can be on same line or a different 
line, as shown in Listing 2-3. 
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The xchg ecx,edx instruction swaps the values in registers ECX and EDX. The reason for doing this is 
that the shl and shr instructions must use register CL for the shift count. The mov eax, edx copies the value 
a into register EAX, and the subsequent shl eax, cl instruction shifts this value left by the number of bits 
that's specified in register CL. The 64-bit pointer value a shr is passed to function IntegerShift using 
register R8 (in 64-bit programming, all pointers are 64 bits). The mov [18], eax instruction saves the result of 
the shift operation to the memory location that's specified by the contents of register R8. 

The subsequent shr edx, cl instruction shifts the value in register EDX (which contains argument 
value a) right by the number of bits specified in register CL. This result is then saved to the memory location 
pointed to by register R9, which contains a pointer to the memory location specified by a shr. The shr 
instruction is used in function IntegerShift since argument a is declared as an unsigned int. Ifa were 
declared an int, the sar (Shift Arithmetic Right) instruction could be used to preserve the sign bit of the 
source operand. The mov eax,1 instruction loads EAX with the constant one to indicate that the value 
of the count argument was valid. It should be noted that the testing of count for a value above 31 was 
implemented to illustrate argument checking in an assembly language function. For shift instructions that 
use an immediate or variable bit shift count, the processor performs a masking operation that limits the shift 
count to a value between 0 and 31 when the target operand is 32-bits (the limits 0 and 63 are used for 64-bit 
operands). Here's the output for source code example Ch02 03. 


Test 1 

count - 6 

a = 3119 (0b00000000000000000000110000101111) 
shl = 199616 (0b00000000000000110000101111000000) 
shr = 48 (0b00000000000000000000000000110000) 
Test 2 

count = 4 

a= 8388736 (0b00000000100000000000000010000000) 
shl = 134219776 (0b00001000000000000000100000000000) 
shr = 524296 (0b00000000000010000000000000001000) 
Test 3 

count = 31 

a = 2147483649 (0b10000000000000000000000000000001) 
shl = 2147483648 (0b10000000000000000000000000000000) 
shr = 1 (0b00000000000000000000000000000001) 
Test 4 

count = 32 

a= 1431655765 (0b01010101010101010101010101010101) 


Invalid shift count 


Advanced Integer Arithmetic 


In this section, you'll learn how to perform integer multiplication and division. You'll also learn how to use 
the x86-64 assembly language instruction set to carry out integer arithmetic using operands of different 
sizes. In addition to these topics, this section introduces important programming concepts and a few 
particulars related to Visual C++ calling convention. 
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Note The Visual C++ calling convention requirements that are described in this section and in subsequent 
chapters may be different for other high-level programming languages and operating systems. If you're reading 
this book to learn x86-64 assembly language and plan on using it with a different high-level programming 
language or operating system, you should consult the appropriate documentation for more information 
regarding the target platform's calling convention requirements. 


Multiplication and Division 


Listing 2-4 contains the source code for example Ch02 04. In this example, the function IntegerMulDiv_ 
computes the product, quotient, and remainder of two integers using the imul (Integer Multiplication) and 
idiv (Integer Division) instructions. Note that the C++ declaration of function IntegerMulDiv includes five 
parameters. Up to this point you've only seen function declarations with a maximum of four parameters, 
and the arguments values for these parameters were passed using registers RCX, RDX, R8, and R9 or the low- 
order portion of these registers. The reason for using these registers is that they are required by the Visual 
C++ calling convention. 


Listing 2-4. Example Ch02 04 


#include "stdafx.h" 
#include <iostream> 


using namespace std; 
extern "C" int IntegerMulDiv (int a, int b, int* prod, int* quo, int* rem); 


void PrintResult(const char* s, int rc, int a, int b, int p, int q, int r) 


{ 


const char nl = ‘\n'; 


cout << s << nl; 


cout << "a = " << a <<", b=" <<b << ", re = " << rc << nl; 


if (rc != 0) 
cout << "prod = " << p << ", quo = " << q << ", rem = " << r << nl; 
else 
cout << "prod = " << p << ", quo = undefined" << ", rem = undefined" << nl; 


cout << nl; 


} 


int main() 
{ 
int rc; 
int a, b; 
int prod, quo, rem; 
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a = 47; 

b = 13; 

prod = quo = rem = 0; 

rc = IntegerMulDiv (a, b, &prod, &quo, &rem); 
PrintResult("Test 1", rc, a, b, prod, quo, rem); 


a = -291; 

boa 75 

prod = quo = rem = 0; 

rc = IntegerMulDiv (a, b, &prod, &quo, &rem); 
PrintResult("Test 2", rc, a, b, prod, quo, rem); 
a = 19; 

b = 0; 

prod = quo = rem = 0; 

rc = IntegerMulDiv (a, b, &prod, &quo, &rem); 
PrintResult("Test 3", rc, a, b, prod, quo, rem); 
a = 247; 

b = 85; 

prod = quo = rem = 0; 

rc = IntegerMulDiv (a, b, &prod, &quo, &rem); 
PrintResult("Test 4", rc, a, b, prod, quo, rem); 


return 0; 


; extern "C" int IntegerMulDiv (int a, int b, int* prod, int* quo, int* rem); 
; Returns: O - error (divisor equals zero), 1 - success 

.code 
IntegerMulDiv proc 


; Make sure the divisor is not zero 


mov eax,edx ;eax = b 
Or eax,eax ;logical OR sets status flags 
jz InvalidDivisor ;junp if b is zero 


; Calculate product and save result 
imul eax,ecx ;eax = a * b 
mov [r8],eax ;save product 


; Calculate quotient and remainder, save results 
mov r10d,edx ;r10d = b 
mov eax,ecx ;eax = a 
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cdq ;edx:eax contains 64-bit dividend 

idiv r10d ;eax - quotient, edx - remainder 

mov [r9],eax ;save quotient 

mov rax, [rsp*40] ;rax - 'rem' 

mov [rax],edx ;save remainder 

mov eax,1 ;set success return code 
InvalidDivisor: 

ret ;return to caller 


IntegerMulDiv endp 
end 


A calling convention is a binary protocol that describes how arguments and return values are 
exchanged between two functions. As you have already seen, the Visual C++ calling convention for x86-64 
programs on Windows requires a calling function to pass the first four integer (or pointer) arguments using 
registers RCX, RDX, R8, and R9. The low-order portions of these registers are used for argument values 
smaller than 64 bits (e.g., ECX, CX, or CL for a 32-, 16-, or 8-bit integer). Any additional arguments are passed 
using the stack. The calling convention also defines additional requirements including rules for passing 
floating-point values, general-purpose and XMM register use, and stack frames. You'll learn about these 
additional requirements in Chapter 5. 

The C++ code in Listing 2-4 is similar to the other examples that you've already seen. It simply exercises 
some test cases and displays results. Upon entry to function IntegerMulDiv , registers ECX, EDX, R8, and R9 
contain the argument values a, b, prod, and quo, respectively. The fifth argument rem is passed on the stack, 
as shown in Figure 2-1. Note that since prod, quo, and rem are pointers, they are passed to IntegerMulDiv as 
64-bit values. 


High 
Memory 


RDX Home #16 mo 
| | RxXHome  |+8 Ta rx 
Low 
Memory Return Address —RSP 


Stack Registers 


[x] = Undefined 


Figure 2-1. Argument registers and stack at entry to function IntegerMulDiv . 
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Figure 2-1 illustrates the state of the stack and the argument registers upon entry to IntegerMulDiv_ 
but prior to the execution of its first instruction. Note that the location of the fifth argument value rem is 
at memory address RSP + 40. As simple mov instruction can be used to load rem, which is a pointer, into a 
general-purpose register when it's needed. Also note in Figure 2-1 that register RSP points to the caller's 
return address on the stack. During execution of a ret instruction, the processor copies this value from the 
stack and ultimately stores it in register RIP. The ret instruction also removes the caller's return address from 
the stack by adding 8 to the value in RSP. The stack locations labeled RCX Home, RDX Home, R8 Home, and 
R9 Home are storage areas that can be used to temporarily save the corresponding argument registers. These 
areas can also be used to store other transient data. You'll learn more about the home area in Chapter 5. 

The function IntegerMulDiv computes and saves the producta * b. It also calculates and saves the 
quotient and remainder ofa / b. Since IntegerMulDiv_ performs division using b, it makes sense to test 
the value of b to confirm that it's not equal to zero. In Listing 2-4, the mov eax, edx instruction copies b into 
register EAX. The next instruction, or eax,eax, performs a bitwise OR operation to set the status flags. If b 
is zero, the jz InvalidDivisor (Jump if Zero) instruction skips over the code that performs the division. 
Like the previous example, the function IntegerMulDiv uses a return value of zero to indicate an error 
condition. Since EAX already contains zero, no additional instructions are necessary. 

The next instruction imul eax,ecx computes a * band saves the product to the memory location 
specified by R8, which contains the pointer prod. The x86-64 instruction set supports several different forms 
ofthe imul instruction. The two-operand form that's used here actually computes a 64-bit result (recall that 
the product of two 32-bit integers is always a 64-bit result) but saves only the lower 32 bits in the destination 
operand. The single-operand form of imul can be used when a non-truncated result is required. 

Integer division occurs next. The mov r10d, rdx and mov eax, ecx instructions load registers RIOD 
and EAX with argument values b and a, respectively. Before performing the division operation, the 32-bit 
dividend in EAX must be sign-extended to 64 bits and this is carried out by the cdq (Convert Doubleword 
to Quadword) instruction. Following execution of cdq, register pair EDX:EAX contains the 64-bit dividend 
and register R10D contains the 32-bit divisor. The idiv r10d instruction divides the contents of register 
pair EDX:EAX by the value in R10D. After execution of the idiv instruction, the 32-bit quotient and 32-bit 
remainder reside in registers EAX and EDX, respectively. The subsequent mov [r9],eax saves the quotient 
to the memory location specified by quo. In order to save the remainder, the pointer rem must be obtained 
from the stack and this is achieved using a mov rax, [rsp*40] instruction. The mov [rax],edx instruction 
saves the remainder to the memory location specified by rem. The output for example Ch02 04 is the 
following: 


Test 1 
a = 47, b = 13, rc - 1 
prod = 611, quo = 3, rem = 8 


Test 2 
a = -291, b = 7, rc = 1 
prod = -2037, quo = -41, rem = -4 


Test 3 
a = 19, b = 0, rc - 0 
prod = 0, quo = undefined, rem = undefined 


Test 4 


a = 247, b = 85, rc = 1 
prod = 20995, quo = 2, rem = 77 
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Calculations Using Mixed Types 


In many programs, it is often necessary to perform arithmetic calculations using multiple integer types. 
Consider the C++ expression a = b * c * d * e, where a, b, c, d, and e are declared as long long, long 
long, int, short, and char. Calculating the correct result requires proper promotion of the smaller-sized 
integers into large ones. In the next example, you'll learn few techniques that can be used to carry out integer 
promotions in an assembly language function. You'll also learn how to access integer argument values of 
various sizes that are stored on the stack. Listing 2-5 contains the source code for example Ch02 05. 


Listing 2-5. Example Ch02 05 


#include "stdafx.h" 
#include «iostream» 
#include <cstdint> 


using namespace std; 


extern "C" int64 t IntegerMul (int8 t a, int16 t b, int32 t c, int64 t d, int8 t e, int16 t 
f, int32 t g, int64 t h); 


extern "C" int UnsignedIntegerDiv (uint8 t a, uint16 t b, uint32 t c, uint64 t d, uint8 t e, 
uint16 t f, uint32 t g, uint64 t h, uint64 t* quo, uint64 t* rem); 


void IntegerMul(void) 


int8 t a = 2; 
int16 t b = -3; 
int32 t c = 8; 
int64 t d = 4j 
int8te-3; 


int16 t f = -7; 
int32 t g = -5; 
int64 t h = 10; 


// Calculatea*b*c*d*e*f*g^*h 
int64 t prodi -a*b*c*d*e*f*g*h; 
int64 t prod2 - IntegerMul (a, b, c, d, e, f, g, h); 


cout << "\nResults for IntegerMul\n"; 


cout << "a=" << (int)a«« ", b="<<b<<",c="<«<c<<"; 
cout << "d= " << d << "y e = " << (inthe << ", F =" << F <<" 5 
cout << "g= " << g <<", h=" <<h << ‘An’; 

cout << "prodi = " << prodi << ‘\n'; 

cout << "prod2 = " << prod2 << ‘\n'; 
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void UnsignedIntegerDiv(void) 


1 
uint8 t a - 12; 
uint16 t b = 17; 
uint32 t c - 71000000; 
uint64 t d - 90000000000; 
uint8 t e - 101; 
uint16 t f = 37; 
uint32 t g - 25; 
uinté4 t h = 5; 
uint64 t quoi, remi; 
uint64 t quo2, rem2; 
quoi = (a+b+c+d)/(e+f+g+h); 
rem = (a+b+c+d)%(e+f+g+h); 
UnsignedIntegerDiv (a, b, c, d, e, f, g, h, &quo2, &rem2); 
cout << "\nResults for UnsignedIntegerDiv An"; 
cout << "a = " << (unsigned)a << ", b="<<b<<", c="<«<c<"; 
cout << "d = " << d << ", e= " << (unsigned)e << ", F = " << F <<" SS 
cout << "g = "<< g << ", h=" <<h << ‘An’; 
cout << "quoi = " << quoi << ", remi = " << remi << ‘\n'; 
cout << "quo2 = " << quo2 << ", rem2 = " << rem2 << ‘\n'; 
} 
int main() 
{ 
IntegerMul(); 
UnsignedIntegerDiv(); 
return 0; 
} 
P mm 
; Cho2_05.asm 


; extern "C" int64 t IntegerMul (int8 t a, int16 t b, int32 t c, int64 t d, int8 t e, 
inti6 t f, int32 t g, int64 t h); 


.code 
IntegerMul proc 


; Calculate a * b* cc * d 


movsx rax,cl ;rax = sign extend(a) 
movsx rdx,dx ;rdx = sign extend(b) 
imul rax,rdx srax = a * b 
movsxd rcx,r8d ;rcx = sign extend(c) 
imul rcx,r9 srcx = c * d 
imul rax,rcx jyax=a*b*c*d 
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> Calculatee*f*g*h 


movsx rcx,byte ptr [rsp+40] ;rcx = sign extend(e) 
movsx rdx,word ptr [rsp+48] ;rdx = sign extend(f) 
imul rcx,rdx srcx =e * f 
movsxd rdx,dword ptr [rsp+56] ;rdx = sign extend(g) 
imul rdx,qword ptr [rsp+64] jrdx = g *h 
imul rcx,rdx roc s et [^ * gc h 


; Compute the final product 
imul rax,rcx ;rax - final product 


ret 
IntegerMul endp 


; extern "C" int UnsignedIntegerDiv (uint8 t a, uint16 t b, uint32 t c, uint64 t d, uint8 t e, 
uint16 t f, uint32 t g, uint64 t h, uint64 t* quo, uint64 t* rem); 


UnsignedIntegerDiv proc 


; Calculate a +b+c+d 


movzx rax,cl ;rax = zero extend(a) 
movzx rdx,dx ;rdx = zero extend(b) 
add rax,rdx jyax = a +b 

mov r8d,r8d 318 = zero extend(c) 

add r8,r9 518=c+d 

add rax,r8 jyax = a +b+c+d 

xor rdx,rdx srdx:rax = a +b+c+d 


; Calculatee+f+g+h 


movzx r8,byte ptr [rsp+40] ;18 = zero extend(e) 

movzx r9,word ptr [rsp+48] ;19 = zero extend(f) 

add r8,r9 518-=-e+f 

mov r10d,[rsp+56] ;110 = zero extend(g) 

add r10,[rsp+64] 5110 = g +h; 

add r8,r10 318=-e+f+g+h 

jnz DivOK ;jump if divisor is not zero 
XOI eax,eax ;set error return code 

jmp done 


; Calculate (a +b+c+d)/(e+f+g+h) 


DivOK: div r8 ;unsigned divide rdx:rax / r8 
mov rcx, [rsp*72] 
mov [rcx],rax ;save quotient 
mov rcx,[rsp+80] 
mov [rcx],rdx ;save remainder 
mov eax,1 ;set success return code 


Done: ret 
UnsignedIntegerDiv endp 
end 
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The assembly language function IntegerMul calculates the product of eight signed integers ranging 
in size from 8 bits to 64 bits. The C++ declaration for this function uses the fixed-sized integer types that 
are declared in the header file «cstdint» instead of the normal long long, int, short, and char. Some 
assembly language programmers (including me) prefer to use fixed-sized integer types for assembly 
language function declarations since it emphasizes the exact size of the argument. The declaration of 
function UnsignedIntegerDiv , which demonstrates how to perform unsigned integer division, also uses 
fixed-size integer types. Figure 2-2 illustrates the contents of the stack at entry to IntegerMul . 


High 
Memory 


el 


RDX Home 
„w RE Home | 
Memory Return Address 


Stack Registers 


m- Undefined 


Figure 2-2. Argument registers and stack at entry to function IntegerMul_ 


The first instruction of IntegerMul ,movsx rax,cl (Move with Sign Extension), sign-extends a copy 
of the 8-bit integer value a that's in register CL to 64 bits and saves this value in register RAX. Note that the 
original value in register CL is unaltered by this operation. Another movsx instruction follows that saves a 
64-bit sign-extend copy ofthe 16-bit value d to RDX. Like the previous movsx instruction, the source operand 
is not modified by this operation. An imul rax,rdx instruction computes the product of a and b. The two- 
operand form of the imul instruction that's used here saves only the lower 64 bits of the 128-bit product in 
the destination operand RAX. The next instruction movsxd rcx,r8dsign-extends the 32-bit operand c to 64 
bits. Note that a different instruction mnemonic is required when sign extending a 32-bit integer to 64 bits. 
The next two imul instructions compute the intermediate producta * b * c * d. 

Calculation of the second intermediate producte * f * g * hiscarried out next. All of these 
argument values were passed using the stack as shown in Figure 2-2. The movsx rcx,byte ptr [rsp+40] 
sign extends a copy of the 8-bit argument value e that's located on the stack and saves the result to register 
RCX. The text byte ptr is a MASM directive that acts like a C++ cast operator and conveys to the assembler 
the size ofthe source operand. Without the byte ptr directive, the movsx instruction is ambiguous since 
several different sizes are possible for the source operand. The argument value f is loaded next using a movsx 
rdx,word ptr [rsp+48] instruction. Following calculation of the intermediate producte * f using an imul 
instruction, amovsxd rdx,dword ptr[rsp+56] instruction loads a sign-extended copy of g into RDX. This is 
followed by an imul rdx,qword ptr[rsp+64] instruction that calculates the intermediate productg * h. 
Use ofthe qword ptr directive is optional here; size directives are often used in this manner to improve 
program readability. The final two imul instructions calculate the final product. 
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Figure 2-3 illustrates the contents of the stack at entry to the function UnsignedIntegerDiv .This 
function calculates the quotient and remainder of the expression (a + b + c + d) / (e + f + g + g).As 
implied by its name, UnsignedIntegerDiv uses unsigned integer arguments of different sizes and performs 
unsigned integer division. In order to calculate the correct results, the smaller-sized arguments must 
be zero-extended prior to any arithmetic operations. The movzx rax,clandmovzx rdx,dx instructions 
load zero-extended copies of argument values a and b into their respective destination registers. The add 
rax, rdx instruction that follows next calculates the intermediate sum a + b. At first glance, the mov r8d,r8d 
instruction that follows seems superfluous, but it's actually performing a necessary operation. When an 
x86 processor is running in 64-bit mode, instructions that employ 32-bit operands produce 32-bit results. 

If the destination operand is a 32-bit register, the high-order 32 bits (i.e., bits 63 - 32) ofthe corresponding 
64-bit register are set to zero. The mov 18d, r8d instruction is used here to zero-extend the 32-bit value c 
that's already loaded in register R8D to a 64-bit value in R8. The next two add instructions calculate the 
intermediate suma + b + c + dandsave the result to RAX. The ensuing xor rdx,rdx instruction yields a 
128-bit zero-extended dividend value that's stored in register pair RDX:RAX. 


[a e 
Les | 
RDX Home | b | ROX 
[ram | [s] rex 
Low 
Memory Return Address 


Stack Registers 


e- Undefined 


Figure 2-3. Argument registers and stack at entry to function UnsignedIntegerDiv_ 


A similar sequence of instructions is used to calculate the intermediate sume + f + g + h, with the 
main difference being that these arguments are loaded from the stack. This value is then tested to see if 
it’s equal to zero since it will be used as the divisor. If the divisor is not zero, a div r8 instruction performs 
unsigned integer division using register pair RDX:RAX as the dividend and register R8 as the divisor. The 
resulting quotient (RAX) and remainder (RDX) are then saved to the memory locations specified by the 
pointers quo and rem, which were passed on the stack. Here’s the output for example Ch02_05. 


Results for IntegerMul 
a-2,b-2-3,c-8d-4,e-3,f--7g» -5, h = 10 
prodi - -201600 

prod2 - -201600 
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Results for UnsignedIntegerDiv 

a = 12, b = 17, c = 71000000 d = 90000000000, e = 101, f = 37 g = 25, h=5 
quoi = 536136904, remi = 157 

quo2 = 536136904, rem2 = 157 


Memory Addressing and Condition Codes 


Thus far the source code examples of this chapter have primarily illustrated how to use basic arithmetic and 
logical instructions. In this section, you'll learn more about the x86's memory addressing modes. You'll also 
examine sample code that demonstrates how to exploit some of the x86's condition-code based instructions. 


Memory Addressing Modes 


You learned in Chapter 1 that the x86-64 instruction set supports a variety of addressing modes that can be 
used to reference an operand in memory. In this section, you'll examine an assembly language function that 
illustrates how to use some of these modes. You'll also learn how to initialize an assembly language lookup 
table and use assembly language global variables in a C++ function. Listing 2-6 shows the source code for 
example ChO2 06. 


Listing 2-6. Example Ch02 06 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 


using namespace std; 


extern "C" int NumFibVals , FibValsSum ; 
extern "C" int MemoryAddressing (int i, int* vi, int* v2, int* v3, int* v4); 


int main() 

( 
const int w = 5; 
const char nl = ‘\n'; 


const char* delim = ", "; 
FibValsSum = 0; 
for (int i = -1; i < NumFibVals + 1; i++) 


{ 


int v1 = -1, v2 = -1, V3 = -1, v4 = -1; 


int rc = MemoryAddressing (i, &v1, &v2, &v3, &v4); 
cout << "i = " << setw(w - 1) << i << delim; 

cout << "rc = " << setw(w - 1) << rc << delim; 
cout << "v1 = " << setw(w) << v1 << delim; 
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cout << "v2 = " << setw(w) << v2 << delim; 
cout << "v3 = " << setw(w) << v3 << delim; 
cout << "v4 = " << setw(w) << v4 << delim; 
cout << nl; 
} 
cout << "FibValsSum = " << FibValsSum << nl; 
return 0; 
} 
E rrr 
s Cho2 06.asm 


; Simple lookup table (.const section data is read only) 
.const 
FibVals dword O, 1, 1, 2, 3, 5, 8, 13 
dword 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597 


NumFibVals dword ($ - FibVals) / sizeof dword 
public NumFibVals 


; Data section (data is read/write) 
.data 
FibValsSum dword ? ;value to demo RIP-relative addressing 
public FibValsSum 
; extern "C" int MemoryAddressing (int i, int* v1, int* v2, int* v3, int* v4); 
; Returns: O - error (invalid table index), 1 - success 
.code 
MemoryAddressing proc 


; Make sure 'i' is valid 


cmp ecx,0 
jl InvalidIndex ;jump if i<o 
cmp ecx,[NumFibVals ] 
jge InvalidIndex ;jump if i >= NumFibVals 
; Sign extend i for use in address calculations 
movsxd rcx,ecx ;sign extend i 
mov [rsp+8],rcx ;save copy of i (in rcx home area) 


; Example #1 - base register 
mov r11,offset FibVals 374i 
shl rcx,2 SYCX 


FibVals 
i*4 
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add r11,rcx 
mov eax, [r11] 
mov [rdx],eax 


5111 = FibVals + i * 4 
;eax = FibVals[i] 
;save to v1 


; Example #2 - base register + index register 


mov r11,offset FibVals 
mov rcx, [rsp48] 

shl rcx,2 

mov eax, [rii«rcx] 

mov [r8],eax 


5111 = FibVals 
srcx = i 

stex = 1 * 4 
;eax = FibVals[i] 
;save to v2 


; Example #3 - base register + index register * scale factor 


mov r11,offset FibVals 
mov rcx,[rsp+8] 

mov eax,[r11+rcx*4] 
mov [r9],eax 


5111 = FibVals 


STEX = d 
;eax = FibVals[i] 
;save to v3 


; Example #4 - base register + index register * scale factor + disp 
mov r11,offset FibVals-42 5111 = FibVals - 42 


mov rcx,[rsp+8] ;Ycx = i 

mov eax, [r11+rcx*4+42 | ;eax = FibVals[i] 
mov r10,[rsp+40] 3110 = ptr to v4 
mov [r10],eax ;save to v4 


; Example #5 - RIP relative 


add [FibValsSum ],eax ;update sum 
mov eax,1 ;set success return code 
ret 

InvalidIndex: 
xor eax,eax ;set error return code 
ret 


MemoryAddressing  endp 
end 


Toward the top of the C++ code are the requisite declaration statements for this example. Earlier in 
this chapter you learned how to reference a C++ global variable in an assembly language function. In this 
example, the opposite is illustrated. Storage space for the variables NumFibVals and FibValsSum is defined 
in the assembly language code, and these variables are referenced in the function main. 

In the assembly language function MemoryOperands , argument i is employed as an index into an array 
(or lookup table) of constant integers, while the four pointer arguments are used to save values loaded from 
the lookup table using different addressing modes. Near the top of Listing 2-6 is a . const directive, which 
defines a memory block that contains read-only data. Immediately following the . const directive, a lookup 
table named FibVals is defined. This table contains 16 doubleword integer values. The text dword is an 
assembler directive that is used to allocate storage space and optionally initialize a doubleword value (the 
text dd can also be used as a synonym for dword). 

The line NumFibVals dword ($ - FibVals) / sizeof dword allocates storage space for a single 
doubleword value and initializes it with the number of doubleword elements in the lookup table FibVals. 
The $ character is an assembler symbol that equals the current value of the location counter (or offset from 
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the beginning of the current memory block). Subtracting the offset of FibVals from $ yields the size of the 
table in bytes. Dividing this result by the size in bytes of a doubleword value generates the correct number of 
elements. These statements emulate a commonly-used technique in C++ to define and initialize a variable 
with the number of elements in an array: 


const int Values[] = (10, 20, 30, 40, 50}; 
const int NumValues - sizeof(Values) / sizeof(int); 


The finalline ofthe .const section declares NumFibVals asa public symbol in order to enable its 
use in main. The .data directive denotes the start of a memory block that contains modifiable data. The 
FibValsSum dword ? statement defines an uninitialized doubleword value, and the subsequent public 
statement makes it globally accessible. 

Let's now look at the assembly language code for MemoryAddressing .Upon entry into the function, the 
argument i is checked for validity since it will be used as an index into the lookup table FibVals. The cmp 
ecx, 0 instruction compares the contents of ECX, which contains i, to the immediate value 0. As discussed 
earlier in this chapter, the processor carries out this comparison by subtracting the source operand from 
the destination operand. It then sets the status flags based on the result of the subtraction (the result is not 
saved to the destination operand). If the condition ecx < O is true, program control will be transferred to 
the location specified by the j1 (Jump if Less) instruction. A similar sequence of instructions is used to 
determine if the value of i is too large. The cmp ecx, [NumFibVals ] instruction compares ECX against the 
number of elements in the lookup table. If ecx >= [NumFibVals] is true, a jump is performed to the target 
location specified by the jge (Jump if Greater or Equal) instruction. 

Immediately following the validation of i, amovsxd rcx,ecx sign-extends the table index value to 64 
bits. Sign-extending or zero-extending a 32-bit integer to a 64-bit integer is often necessary when using an 
addressing mode that employs an index register as you'll soon see. The subsequent mov [rsp+8],rcx saves 
a copy of the signed-extended table index value to the RCX home area on the stack and is done primarily to 
exemplify use of the stack home area. 

The remaining instructions of MemoryAddressing illustrate accessing items in the lookup table using 
various memory addressing modes. The first example uses a single base register to read an item from the 
table. In order to use a single base register, the function must explicitly calculate the address of the i-th table 
element, which is achieved by adding the offset (or starting address) of FibVals and the value i * 4. The 
mov rii,offset FibVals instruction loads R11 with the correct table offset value. This is followed by a shl 
ICX,2 instruction that determines the offset of the i-th item relative to the start of the lookup table. An add 
r11,rcx instruction calculates the final address. Once this is complete, the specified table value is read using 
a mov eax, [r11] instruction. It is then saved to the memory location specified by the argument v1. 

In the second example, the table value is read using BaseReg+IndexReg memory addressing. This 
example is similar to the first one except that the processor computes the final effective address during 
execution of the mov eax,[r11+rcx] instruction. Note that recalculation of the lookup table element offset 
using the mov rcx, [rsp*8] and shl rcx,2 instructions is unnecessary here but included to illustrate use of 
the stack home area. 

The third example demonstrates use of BaseReg+IndexReg*ScaleFactor memory addressing. In 
this example, the offset of FibVals and the value i are loaded into registers R11 and RCX, respectively. 

The correct table value is loaded into EAX using a mov eax, [r114rcx*4] instruction. In the fourth (and 
somewhat contrived) example, BaseReg+IndexReg*ScaleFactor+Disp memory addressing is demonstrated. 
The fifth and final memory address mode example uses an add[FibValsSum ],eax instruction to 
demonstrate RIP-relative addressing. This instruction, which uses a memory location as a destination 
operand, updates a running sum that is ultimately displayed by the C++ code. 
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The function main that's shown in Listing 2-6 contains a simple looping construct that exercises the 
function MemoryOperands including test cases with an invalid index. Note that the for loop uses the 
variable NumFibVals , which was defined as a public symbol in the assembly language file. The output for 
the sample program Ch02 06 is shown here. 


i= 1, 068 0, v1 = -1, v2 = -1, v3 = -1, v4 = -1, 
i = 0, rc = 1, Vi = 0, v2 = 0, v3 = 0, v4 = 0, 
L = 1, IC = 1, V1 = 1, v2 = 1, v3 - 1, v4 = À; 
i= 2, TC = 1, vd = 1, v2 = 1, v3 = 1, v4 = 15 
Irs 3, IC = 1, vi = 2, v2 = 2, V3 = 2, V4 = 2; 
i 4, IC = 1, vd = 3, v2 - 3, v3 = 3, v4 = 3, 
1 = 5, TC = 1, vi = 5, V2 s 5, V3 = 5, V4 = 5, 
iz 6, IC = 1, vi = 8, v2 = 8, V3 = 8, v4 = 8, 
i- 7, rc = 1, v1 = 13, v2 = 13, v3 = 13, V4= 13, 
i=  8,rc- 1, vi = 21, v2 = 21, v3 = 21, V4= 21, 
i= 9, rc = 1, vi = 34, v2 = 34, v3 = 34, v4 = 34, 
i= 10, rc= 1, vl = 55, v2 = 55, V3 = 55, VA= 55, 
ic 11, IG - 1, VL = 89, v2 = 89, v3 = 89, v4 = 89, 
ls 12, rcs 1, V1 = 144,v2- 144,v3- 144,v4- 144, 
i= 13, rc = 1, V1 = 233, v2 = 233, V3 = 233, V4 = 2233, 
i= 14, rc = 1, vl = 377, V2 = 377, V3 = 377, V4 = 377, 
i= 15, rc = 1, V1 = 610, v2 = 610, v3 = 610, v4 = 610, 
i= 16,rc= 1, V1= 987, V2 = 987, V3 = 987, V4 = 987, 
i= 17, rc = 1, V1 = 1597, v2 = 1597, v3 = 1597, v4 = 1597, 
i= 18; rc = 0, v1 = -1, v2 = -1, V3 = -1, V4 = -1, 


FibValsSum = 4180 


Given the multiple addressing modes that are available on an x86 processor, you might wonder which 
mode should be used. The answer to this question depends on a number of factors, including register 
availability, the number of times an instruction (or sequence of instructions) is expected to execute, 
instruction ordering, and memory space vs. execution time tradeoffs. Hardware features such as the 
processor's underlying microarchitecture and cache sizes also need to be considered. 

When coding an x86 assembly language function, one suggested guideline is to favor simple (a single 
base register or displacement) rather than complex (multiple registers) memory addressing. The drawback 
of this approach is that the simpler forms generally require the programmer to code longer instruction 
sequences and may consume more code space. The use of a simple form also may be imprudent if extra 
instructions are needed to preserve non-volatile registers on the stack (non-volatile registers are explained 
in Chapter 3). Chapter 15 considers in greater detail some of the issues and tradeoffs that can affect the 
efficiency of assembly language code. 


Condition Codes 


The final sample program of this chapter expounds on how to use the x86's conditional instructions jcc 
(Conditional Jump) and cmovcc (Conditional Move). As you have already seen in a few of this chapter's 
source code examples, the execution of a conditional instruction is contingent on its specified condition 
code and the state of one or more status flags. The source code example Ch02 07, which is shown in 
Listing 2-7, demonstrates a few more use cases for the previously-mentioned instructions. 
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Listing 2-7. Example Ch02_07 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 


using namespace std; 


extern "C" int SignedMinA (int a, int b, int c); 
extern "C" int SignedMaxA (int a, int b, int c); 
extern "C" int SignedMinB (int a, int b, int c); 
extern "C" int SignedMaxB (int a, int b, int c); 


void PrintResult(const char* s1, int a, int b, int c, int 


{ 


result) 


const int w = 4; 


cout << s1 << "("; 
cout << setw(w) << a << ", "5 
cout << setw(w) << b << ", "5 
cout << setw(w) << € << ") =" 
cout << setw(w) << result << ‘\n'; 
} 
int main() 
{ 


int a, b, c; 
int smin a, smax a, smin b, smax b; 


// SignedMin examples 
a-2; b = 15; c=8; 
smin_a = SignedMinA (a, b, c); 
smin b - SignedMinB (a, b, c); 


PrintResult("SignedMinA", a, b, c, smin a); 
PrintResult("SignedMinB", a, b, c, smin b); 
cout << ‘\n'; 

a = -3; b = -22; c = 28; 

smin a = SignedMinA (a, b, c); 

smin b - SignedMinB (a, b, c); 
PrintResult("SignedMinA", a, b, c, smin a); 
PrintResult("SignedMinB", a, b, c, smin b); 
cout << ‘\n'; 

a = 17; b = 37; c = -11; 

smin a = SignedMinA (a, b, c); 

smin b - SignedMinB (a, b, c); 
PrintResult("SignedMinA", a, b, c, smin a); 
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PrintResult("SignedMinB", a, b, 
cout << ‘\n'; 


// SignedMax examples 

a = 10; b = 53 c = 3; 

smax_a = SignedMaxA (a, b, c); 
smax_b = SignedMaxB (a, b, c); 
PrintResult("SignedMaxA", a, b, 
PrintResult("SignedMaxB", a, b, 
cout << ‘\n'; 


a = -3; b = 28; c = 15; 

smax_a = SignedMaxA (a, b, c); 
smax_b = SignedMaxB (a, b, c); 
PrintResult("SignedMaxA", a, b, 
PrintResult("SignedMaxB", a, b, 
cout << ‘\n'; 


a = -25; b = -37; c = -17; 
smax a = SignedMaxA (a, b, c); 
smax b - SignedMaxB (a, b, c); 
PrintResult("SignedMaxA", a, b, 
PrintResult("SignedMaxB", a, b, 
cout << ‘\n'; 


; extern "C" int SignedMinA (int a, 
5 
; Returns: min(a, b, c) 


.code 
SignedMinA proc 

mov eax,ecx 

cmp eax,edx 

jle GF 

mov eax,edx 


@@: cmp eax,r8d 
jle @F 
mov eax,r8d 


QQ: ret 
SignedMinA endp 


C, 


C, 
C, 


C, 
C, 


C, 
C, 


smin b); 


smax a); 
smax b); 


smax a); 
smax b); 


smax a); 
smax b); 


int b, int c); 


;compare a and b 
;jump if a <= b 
;eax = b 


;compare min(a, b) and c 


;eax - min(a, b, c) 


; extern "C" int SignedMaxA (int a, int b, int c); 


b 
; Returns: max(a, b, c) 
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SignedMaxA proc 
mov eax,ecx 


cmp eax,edx ;compare a and b 
jge GF ;jump if a >= b 
mov eax,edx ;eax = b 
QA: cmp eax,r8d ;compare max(a, b) and c 
jge GF 
mov eax,r8d ;eax - max(a, b, c) 
(QQ : ret 


SignedMaxA endp 


; extern "C" int SignedMinB (int a, int b, int c); 
P 
; Returns: min(a, b, c) 


SignedMinB proc 
cmp ecx,edx 


cmovg ecx,edx ;ecx - min(a, b) 
cmp ecx,r8d 

cmovg ecx,r8d ;ecx - min(a, b, c) 
mov eax,ecx 

ret 


SignedMinB endp 


; extern "C" int SignedMaxB (int a, int b, int c); 


3 
; Returns: max(a, b, c) 


SignedMaxB proc 
cmp ecx,edx 
cmovl ecx,edx ;ecx 
cmp ecx,r8d 
cmovl ecx,r8d $ecx 
mov eax,ecx 
ret 
SignedMaxB endp 
end 


max(a, b) 


max(a, b, c) 


When developing code to implement a particular algorithm, it is often necessary to determine the 
minimum or maximum value of two numbers. The standard C++ library defines two template functions 
named std: :min() and std: :max() to perform these operations. The assembly language code that's 
shown in Listing 2-7 contains several three-argument versions of signed-integer minimum and maximum 
functions. The purpose of these functions is to illustrate proper use ofthe jcc and cmovcc instructions. The 
first function, called SignedMinA , finds the minimum value of three signed integers. The first code block 
determines min(a, b) using two instructions: cmp eax,ecx and jle @F. The cmp instruction, which you 
saw earlier in this chapter, subtracts the source operand from the destination operand and sets the status 
flags based on the result (the result is not saved). The operand of the jle (Jump if Less or Equal) instruction, 
@F, is an assembler symbol that designates nearest forward (8 label as the target of the conditional jump 
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(the symbol @B can be used for backward jumps). Following calculation of min(a, b), the next code block 
determines min(min(a, b), c) using the same technique. With the result already present in register EAX, 
SignedMinA can return to the caller. 

The function SignedMaxA uses the same approach to find the maximum of three signed integers. The 
only difference between SignedMaxA and SignedMinA is the use of a jge (Jump if Greater or Equal) instead 
of a jle instruction. Versions of SignedMinA and SignedMaxA that operate on unsigned integers can be 
easily created by changing the jle and jge instructions to jbe (Jump if Below or Equal) and jae (Jump if 
Above or Equal), respectively. Recall from the discussion in Chapter 1 that condition codes using the words 
"greater" and "less" are intended for signed integer operands, while "above" and "below" are used with 
unsigned integer operands. 

The assembly language code also contains the functions SignedMinB and SignedMaxB .These 
functions determine the minimum and maximum of three signed integers using conditional move 
instructions instead of conditional jumps. The cmovcc instruction tests the specified condition and if it's true, 
the source operand is copied to the destination operand. If the specified condition is false, the destination 
operand is not altered. 

If you examine the function SignedMinB , you will notice that following the cmp ecx, edx instruction is 
acmovg ecx,edx instruction. The cmovg (Move if Greater) instruction copies the contents of EDX to ECX if 
ECX is greater than EDX. In this example, registers ECX and EDX contain argument values a and b. Following 
execution of the cmovg instruction, register ECX contains min(a, b). Another cmp and cmovg instruction 
sequence follows which yields min(a, b, c). The same technique is used in SignedMaxB , which employs 
cmov1 instead of cmovg to save the largest signed integer. Unsigned versions of these functions can be easily 
created by using cmova and cmovb instead of cmovg and cmov1, respectively. Here's the output for Ch02 07. 


SignedMinA( 2, 15, 


oo 
— 
M 
N 


SignedMinB( 2, 15, 8) = 2 
SignedMinA( -3, -22, 28) = -22 
SignedMinB( -3, -22, 28) = -22 
SignedMinA( 17, 37, -11) = -11 
SignedMinB( 17, 37, -11) = -11 
SignedMaxA( 10, 5, 3) = 10 
SignedMaxB( 10, 5, 3) = 10 
SignedMaxA( -3, 28, 15) = 28 
SignedMaxB( -3, 28, 15) = 28 
SignedMaxA( -25, -37, -17) = -17 
SignedMaxB( -25, -37, -17) = -17 


The use of a conditional move instruction to eliminate one or more conditional jump statements 
frequently results in faster code, especially in situations where the processor is unable to accurately predict 
whether the jump will be performed. You'll learn more about some of issues related to optimal use of the 
conditional jump and conditional move instructions in Chapter 15. 
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Summary 


Here are the key learning points of Chapter 2: 


The add and sub instructions perform integer (signed and unsigned) addition and 
subtraction. 


The imul and idiv instructions carry out signed integer multiplication and division. 
The corresponding instructions for unsigned integers are mul and div. The idiv and 
div instructions usually require the dividend to be sign- or zero-extended prior to 
use. 


The and, or, and xor instructions are used to perform bitwise AND, inclusive OR, and 
exclusive OR operations. The shl and shr instructions execute logical left and right 
shifts; sar is used for arithmetic right shifts. 


Nearly all arithmetic, logical, and shift instructions set the status flags to indicate the 
results of an operation. The cmp instruction also sets the status flags. The jcc and 
cmovcc instructions can be used to alter program flow or perform conditional data 
moves based on the state of one or more status flags. 


The x86-64 instruction set supports a variety of different address modes for accessing 
operands stored in memory. 


MASM uses the .code, .data, and .const directives to designate code, data, and 
constant data sections. The directives proc and endp denote the beginning and end 
of an assembly language function. 


The Visual C++ calling convention requires a calling function to use registers RCX, 
RDX, R8, and R9 (or the low-order portions of these registers for values smaller than 
64 bits) for the first four integer or pointer arguments. Additional arguments are 
passed on the stack. 


To disable the creation of decorated names by the C++ compiler, assembly language 
functions must be declared using the extern "C" modifier. Global variables shared 
between C++ and assembly language code must also use the extern "C" modifier. 
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X86-64 Core Programming - Part 2 


The previous chapter introduced the fundamentals of x86-64 assembly language programming. You 
learned how to use the x86-64 instruction set to perform integer addition, subtraction, multiplication, 
and division. You also examined source code that illustrated use of logical instructions, shift operations, 
memory addressing modes, and conditional jumps and moves. In addition to learning about frequently used 
instructions, your initiation to x86-64 assembly language programming has also covered important practical 
details including assembler directives and calling convention requirements. 

In this chapter, your exploration of x86-64 assembly language programming fundamentals continues. 
You'll learn how to use additional x86-64 instructions and assembler directives. You'll also study 
source code that elucidates how to manipulate common programming constructs including arrays and 
data structures. This chapter concludes with several examples that demonstrate use of the x86's string 
instructions. 


Arrays 


Arrays are an indispensable data construct in virtually all programming languages. In C++ there is an 
inherent connection between arrays and pointers since the name of an array is essentially a pointer to its 
first element. Moreover, whenever an array is used as a C++ function parameter, a pointer is passed instead 
of duplicating the array on the stack. Pointers are also employed for arrays that are dynamically allocated at 
runtime. This section examines x86-64 assembly language code that processes arrays. The first two sample 
programs demonstrate how to perform simple operations using one-dimensional arrays. This is followed by 
two examples that explain the techniques necessary to access the elements of a two-dimensional array. 


One-Dimensional Arrays 


In C++ one-dimensional arrays are stored in a contiguous block of memory that can be statically allocated 

at compile time or dynamically during program execution. The elements of a C++ array are accessed using 
zero-based indexing, which means that valid indices for an array of size N range from 0 to N-1. The sample 
code of this section includes examples that carry out basic operations with one-dimensional arrays using the 
x86-64 instruction set. 


Accessing Elements 


Listing 3-1 shows the source code for example Ch03 01. In this example, the function CalcArraySum 

sums the elements of an integer array. Near the top of the C++ code is the now familiar declaration for the 
assembly-language function CalcArraySum . The summing calculation that's performed by this function is 
duplicated in the C++ function CalcArraySumCpp for comparison purposes. 
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Listing 3-1. Example Ch03 01 


#include "stdafx.h" 

#include <iostream> 

#include <iomanip> 

using namespace std; 

extern "C" int CalcArraySum (const int* x, int n); 
int CalcArraySumCpp(const int* x, int n) 

{ 


int sum = 0; 


for (int i = 0; i < n; ie) 
sum += *X++; 


return sum; 


} 
int main() 
{ 
int x[] (3, 17, -13, 25, -2, 9, -6, 12, 88, -19}; 
int n = sizeof(x) / sizeof(int); 
cout << "Elements of array x" << '\n'; 
for (int i = 0; i < n; i+) 
cout << "x[" << i << "] = " << x[i] << "An"; 
cout << ‘\n'; 
int sumi = CalcArraySumCpp(x, n); 
int sum2 - CalcArraySum (x, n); 
cout << "sumi = " << sum1 << ‘\n'; 
cout << "sum2 = " << sum2 << ‘\n'; 
return 0; 
} 
H wee ee ee ee ew ee ee ee ee ee ae ee ee ee ee ee ee ee ee ee ee ee ee eee 
$ Ch03_01.asm 


; extern "C" int CalcArraySum (const int* x, int n) 


; Returns: Sum of elements in array x 
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.code 
CalcArraySum proc 


; Initialize sum to zero 
XOI eax,eax ;sum = 0 


; Make sure 'n' is greater than zero 
cmp edx,0 


jle InvalidCount ;jump if n <= 0 


; Sum the elements of the array 


QQ: add eax, [rcx] ;add next element to total (sum += *x) 
add rcx,4 ;set pointer to next element (xe) 
dec edx ;adjust counter (n -= 1) 
jnz GB ;repeat if not done 
InvalidCount: 
ret 


CalcArraySum  endp 
end 


The function CalcArraySum begins with an xor eax,eax instruction that initializes the running sum to 
zero. The cmp edx,0 and jle InvalidCount instructions prevent execution of the summing loop ifn <= 0 
is true. Sweeping through the array to sum the elements requires only four instructions. The add eax, [rcx] 
instruction adds the current array element to the running sum in register EAX. Four is then added to register 
RCX so that it points to the next element in the array. The constant four is used here since the size of each 
integer in array x is four bytes. A dec edx (Decrement by 1) instruction subtracts 1 from the counter and 
updates the state of RFLAGS.ZF. This enables the jnz instruction to terminate the loop after all n elements 
have been summed. The instruction sequence employed here to calculate the array element sum is the 
assembly language equivalent of the for loop that's used in function CalcArraySumCpp. Here's the output for 
Ch03 O1: 


Elements of array x 


x[0] = 3 
x[1] = 17 
x[2] = -13 
x[3] = 25 
x[4] = -2 
x[5] = 9 
x[6] = -6 
x[7] = 12 
x[8] = 88 
x[9] = -19 
sumi = 114 
sum2 = 114 
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Using Elements in Calculations 


When working with arrays, it is frequently necessary to define functions that perform element-by-element 
transformations. The next source code example, named Ch03 02, illustrates an array transformation 
operation using separate source and destination arrays. It also introduces function prologs and epilogs, and 
a few new instructions. Listing 3-2 shows the source code for example Ch03 02. 


Listing 3-2. Example Ch03 02 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 
#include <cassert> 


using namespace std; 

extern "C" long long CalcArrayValues (long long* y, const int* x, int a, short b, int n); 
long long CalcArrayValuesCpp(long long* y, const int* x, int a, short b, int n) 

{ 


long long sum = 0; 


for (int i = 0; i < n; i+) 


{ 
y[i] = (long long)x[i] * a + b; 
sum += y[i]; 
} 
return sum; 
} 
int main() 
{ 


const int a = -6; 

const short b = -13; 

const int x[] {26, 12, -53, 19, 14, 21, 31, -4, 12, -9, 41, 7}; 
const int n = sizeof(x) / sizeof(int); 


long long yi[n]; 
long long y2[n]; 


long long sum yi - CalcArrayValuesCpp(yi, x, a, b, n); 
long long sum y2 - CalcArrayValues (y2, x, a, b, n); 


cout << "a=" << a << '\n'; 
cout << "b=" << b << ‘\n'; 
cout << "n = " << n << "\n\n"; 


54 


CHAPTER 3 M X86-64 CORE PROGRAMMING — PART 2 


for (int i = 0; i < n; i+) 


( 
cout << "i: " << setw(2) ««i««" '5 
cout << "x: " << setw(6) << x[i] <<" "; 
cout << "y1: " << setw(6) << yi[i] «< " '5 
cout << "y2: " << setw(6) << y2[i] << ‘\n'; 
) 
cout << ‘\n'; 
cout << "sum y1 = " << sum y1 << 'An'; 
cout << "sum y2 = " << sum y2 << '\n'; 
return 0; 
} 
j ---------2-2-2.22-2-2-22-2-2-22-2-2-2-2-2-2-2-2-2-2-2-2-2-2--2-2-2--2--2-22--2------- 
; Ch03 02.asm 


; extern "C" long long CalcArrayValues (long long* y, const int* x, int a, short b, int n); 
; Calculation: y[i] = x[i] * a+b 
5 


; Returns: Sum of the elements in array y. 


.code 
CalcArrayValues proc frame 


; Function prolog 


push rsi ;save volatile register rsi 
.pushreg rsi 
push rdi ;save volatile register rdi 
.pushreg rdi 
.endprolog 
; Initialize sum to zero and make sure 'n' is valid 
xor rax,rax jsum = 0 
mov r11d,[rsp+56] sriid = n 
cmp r11d,0 
jle InvalidCount ;jump if n <= 0 


; Initialize source and destination pointers 
mov rsi,rdx ;rsi = ptr to array x 
mov rdi,rcx ;rdi = ptr to array y 


; Load expression constants and array index 


movsxd r8,r8d ;18 = a (sign extended) 
movsx r9,r9w 319 = b (sign extended) 
xor edx,edx ;edx - array index i 


55 


CHAPTER 3 | X86-64 CORE PROGRAMMING — PART 2 


; Repeat until done 


QQ: movsxd rcx,dword ptr [rsi+rdx*4] ;rcx = x[i] M extended) 
imul rcx,r8 srcx = x[i] * 
add rcx,r9 ;YCX = = ]* a+b 
mov qword ptr [rdi+rdx*8],rcx ;y[i] = rex 
add rax,rcx ;update running sum 
inc edx jedx = i + i 
cmp edx,r11d jis i >= n? 
jl GB jump if i< n 
InvalidCount: 
; Function epilog 
pop rdi ;restore caller's rdi 
pop rsi ;restore caller's rsi 
ret 
CalcArrayValues  endp 
end 


The x86-64 assembly language function CalcArrayValues computesy[i] = x[i] * a + b.Ifyou 
examine the declaration for this function in the C++ code, you will notice that the source array x is declared 
as an int while the destination array y is declared as long long. The other function arguments a, b, and 
n are declared as int, short, and int respectively. The remainder of the C++ code includes the function 
CalcArrayValuesCpp that also computes the specified array transformation for comparison purposes. It also 
includes code to display the results. 

You may have noticed that in all of the sample source code presented thus far, only a subset of the 
general-purpose registers have been used. The reason for this is that the Visual C++ calling convention 
designates each general-purpose register as either volatile or non-volatile. Functions are permitted to use 
and alter the contents of any volatile register but cannot use a non-volatile register unless it preserves the 
caller's original value. The Visual C++ calling convention designates registers RAX, RCX, RDX, R8, R9, R10, 
and R11 as volatile and the remaining general-purpose registers as non-volatile. 

The function CalcArrayValues uses non-volatile registers RSI and RDI, which means that their values 
must be preserved. A function typically saves the values of any non-volatile registers it uses on the stack in 
a section of code called the prolog. A function epilog contains code that restores the values of any saved 
non-volatile registers. Function prologs and epilogs are also used to perform other calling-convention 
initialization tasks and you'll learn about these in Chapter 5. 

In the assembly language code for Ch03 02, the statement CalcArrayValues proc frame denotes the 
start of function CalcArrayValues .Note the frame attribute on the proc directive. This attribute indicates 
that CalcArrayValues uses a formal function prolog. It also enables additional directives that must be used 
whenever a general-purpose register is saved on the stack or whenever a function employs a stack frame 
pointer. Chapter 5 discusses the frame attribute and stack frame pointers in greater detail. 

The first x86-64 assembly language instruction of CalcArrayValues is push rsi (Push Value onto 
Stack), which saves the current value in register RSI on the stack. Immediately following this is a . pushreg 
rsi directive. This directive instructs the assembler to save information about push rsi instruction in an 
assembler-maintained table that is used to unwind the stack during exception processing. Using exceptions 
with assembly language code is not discussed in this book but the calling convention requirements for 
saving registers on the stack must still be observed. Register RDI is then saved on the stack using a push rdi 
instruction. The required .pushreg rdi directive follows next and the subsequent .endprolog directive 
signifies the end ofthe prolog for CalcArrayValues . 
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Figure 3-1 illustrates the contents of the stack after execution ofthe push rsiand push rdi 
instructions. Following the function prolog, argument n is tested for validity. A mov r11d,[rsp+56] loads 
the value of n into register R11D. It is important to note that the displacement used in this instruction to 
load n from the stack is different than in previous examples due to the push instructions that were used in 
the prolog. If the value of n is valid, registers RSI and RDI are initialized as pointers to the arrays x and y. The 
movsxd r8,r8d and movsx r9,r9w instructions load argument values a and b into registers R8 and R9 while 
the xor edx, edx instructions initializes array index i to zero. 


R9 Home 


RDX Home 
Return Address 
Low 
Memory Saved RDI 


Stack Registers 


[ia] = Undefined 


Figure 3-1. Stack and register contents after prolog in CalcArrayValues_ 


High 
Memory 


The processing loop ofCalcArrayValues_ uses amovsxd rcx,dword ptr [rsi+rdx*4] instruction to 
load a sign-extended copy of x[i] into register RCX. The ensuing imul rcx,r8 and add rcx, r9 instructions 
calculate x[i] * a + band themov qword ptr [rdi+rdx*8] instruction saves the final result to y [i]. Note 
that in the processing loop, the two move instructions use different scale factors. This is because array x and 
array y are declared as int and long long. The add rax,rcxinstruction updates a running sum that will be 
used as the return value. The inc edx (Increment by 1) instruction adds 1 to the value that's in register EDX. 
It also zeros bits 63:32 of register RDX. The reason for using an inc edx instruction instead of an inc rdx 
instruction is that the machine language encoding of the former requires less code space. More importantly, 
itis okay to use an inc edx instruction here since the maximum number of elements to be processed is 
specified by a 32-bit signed integer (n) that's already been validated as being greater than zero. The following 
cmp edx,r11d instruction compares the contents of EDX (which is i) to n, and the processing loop repeats 
until i equals n. 

After the main processing loop is the epilog for function CalcArrayValues . Recall that in the prolog, 
the caller's RSI and RDI registers were saved on the stack using two push instructions. In the epilog, the 
instructions pop rdiandpop rsi (Pop Value from Stack) are used to restore the caller's RDI and RSI 
registers. The order in which a caller's non-volatile register are popped from the stack in an epilog must 
be the reverse of how they were saved in the prolog. Following non-volatile register restoration is a ret 
instruction that transfers program control back to the calling function. Given the stack operations that 
occur in a function's prolog and epilog, it should be readily apparent that failure to properly save or restore 
a non-volatile register is likely to cause a program crash (if the return address is incorrect) or a subtle 
software bug that may be difficult to pinpoint. Here are the results for example Ch03 02. 
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a = -6 

b = -13 

n = 12 

i: 0 x: 26 y1: -169 y2:  -169 
i: 1 x: 12 yi: -85 y2: -85 
is 2 Xi -53 yi: 305 y2: 305 
iis 3 Xi 19 y1: -127 y2: -127 
i: 4 x: 14 yi: -97 y2: -97 
i: 5 x: 21 y1: -139 y2:  -139 
i: 6 x: 31 yl: -199 y2: -199 
i: 7 x: -4 y1: 11 y2: 11 
i: 8 x: 12 y1: -85 y2: -85 
i: 9 x -9 y1: 41 y2: 41 
is 10 X: 41 y1: -259 y2:  -259 
i: 11 x: 7 yi: -55 y2: -55 
sum_y1 = -858 

sum_y2 = -858 


Two-Dimensional Arrays 


C++ also utilizes a contiguous block of memory to implement a two-dimensional array or matrix. The 
elements of a C++ matrix in memory are organized using row-major ordering. Row-major ordering arranges 
the elements of a matrix first by row and then by column. For example, elements of the matrix int x[3][2] 
are stored in memory as follows: x[0] [0], x[0] [1], x[1] [0], x[1][1], x[2] [0], and x[2] [1]. In order to 
access a specific element in the matrix, a function (or a compiler) must know the starting address of the 
matrix (i.e., the address of its first element), the row and column indices, the total number of columns, and 
the size in bytes of each element. Using this information, a function can use simple arithmetic to access a 
specific element in a matrix as exemplified by the sample code in this section. 


Accessing Elements 


Listing 3-3 shows the source code for example Ch03 03, which demonstrates how to use x86-64 assembly 
language to access the elements of a matrix. In this example, the functions CalcMatrixSquaresCpp and 
CalcMatrixSquares perform the following matrix calculation: y[i][j] = x[j][i] * x[j][i]. Note that in 
this expression, the indices i and j for matrix x are intentionally reversed in order to make the code for this 
example a little more interesting. 


Listing 3-3. Example Ch03 03 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 
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using namespace std; 
extern "C" void CalcMatrixSquares (int* y, const int* x, int nrows, int ncols); 


void CalcMatrixSquaresCpp(int* y, const int* x, int nrows, int ncols) 


{ 


for (int i = 0; i < nrows; i++) 


{ 


for (int j 
{ 


0; j < ncols; j++) 


int kx = j * ncols + i; 
int ky = i * ncols + j; 
y[ky] = x[kx] * x[kx]; 


int main() 


const int nrows = 6; 
const int ncols = 3; 
int y2[nrows][ncols]; 
int yi[nrows][ncols]; 
int x[nrows][ncols] { { 1, 2, 3 Jy { 4, 5, 6 h ( 7, 8, 9 h 
{ 10, 11, 12 }, (13, 14, 15}, (16, 17, 18) }; 


0], &x[0][0], nrows, ncols); 


CalcMatrixSquaresCpp(&y1[0] [ 
[0], &x[0][0], nrows, ncols); 


CalcMatrixSquares (&y2[0] 


for (int i = 0; i < nrows; i++) 
( 
for (int j = 0; j « ncols; j++) 
( 
cout << "y1[" << setw(2) << i << "][" << setw(2) << j << "J 3 
cout << setw(6) << y1[i][j] << ' ' ; 


cout << "y2[" << setw(2) << i << "][" << setw(2) << j << "J 5 
cout << setw(6) << y2[i][j] << ' '; 


cout << "x[" << setw(2) << j << "][" << setw(2) << i << "] = "; 
cout << setw(6) << x[j][i] << ‘\n'; 


if (ya[1][5] != y2[i][j]) 
cout << "Compare failed\n"; 


} 


return 0; 
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; void CalcMatrixSquares (int* y, const int* x, 


yLi1D3] = xD31H3] * xD1E 


; Calculates: 


.code 


CalcMatrixSquares proc frame 


; Function prolog 


push rsi 
.pushreg rsi 
push rdi 
.pushreg rdi 
.endprolog 


; Make sure nrows and ncols are valid 


cmp r8d,0 
jle InvalidCount 
cmp r9d,0 
jle InvalidCount 


int n 


rows, int ncols); 


; Initialize pointers to source and destination arrays 


mov rsi,rdx 
mov rdi,rcx 
XOY ICX,ICX 
movsxd r8,r8d 
movsxd r9,r9d 


; Perform the required calculations 


Loop1: 


Loop2: 
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xor rdx,rdx 


mov rax,rdx 

imul rax,r9 

add rax,rcx 

mov r10d,dword ptr [rsi+rax*4] 
imul r10d,r10d 


mov rax,rcx 
imul rax,r9 
add rax,rdx 
mov dword ptr [rdi+rax*4],r10d 


inc rdx 
cmp rdx,r9 
jl Loop2 


inc rcx 
cmp rcx,r8 
jl Loopi 


;save caller's rsi 

;save caller's rdi 

;jump if nrows <= 0 
;jump if ncols <= 0 

ixsi = X 

;rdi = y 

STCX eol 

;18 = nrows sign extended 
;19 = ncols sign extended 
;rdx = j 

jrax = 

;rax = j * ncols 

;raxX = j * ncols + i 
;x10d = x[j][i] 

;r10d = x[5][3] * x[j][i] 
;rax = i 

;rax = i * ncols 

;rax = i * ncols + j; 
syli][j] = r10d 

;j t= 1 

;jump if j < ncols 

ji +1 

;jump if i < nrows 
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InvalidCount: 


; Function epilog 


pop rdi ;restore caller's rdi 
pop rsi ;restore caller's rsi 
ret 


CalcMatrixSquares  endp 
end 


The C++ function CalcMatrixSquaresCpp illustrates how to access the elements of a matrix. The first 
thing to note is that arguments x and y point to the memory blocks that contain their respective matrices. 
Inside the second for loop, the expression kx = j * ncols + icalculates the offset necessary to access 
element x[;] [i]. Similarly, the expression ky = i * ncols + j calculates the offset for element y[i][j]. 

The assembly language function CalcMatrixSquares implements the same calculations as the C++ 
code to access elements in matrices x and y. This function begins with a prolog that saves non-volatile 
registers RSI and RDI using the same instructions and directives as the previous source code example. Next, 
argument values nrows and ncols are checked to ensure that they're greater than zero. Prior to the start of 
the nested processing loops, registers RSI and RDI are initialized as pointers to x and y. Registers RCX and 
RDX are also primed as the loop index variables and perform the same functions as variables i and j in the 
C++ code. This is followed by two movsxd instructions that load sign-extended copies of nrows and ncols 
into registers R8 and R9. 

The section of code that accesses element x[ j] [i] begins with a mov rax,rdx instruction that copies 
j into register RAX. This is followed by the instructions imul rax,r9 and add rax,rcx, which compute the 
value j * ncols + i. The ensuing mov r10d,dword ptr [rsi+rax*4] instruction loads register R10D with 
x[j][i]andthe imul r10d,r10d instruction squares this value. A similar sequence of instructions is used 
to calculate the offset i * ncols + jthat's needed for y[i][j]. The mov dword ptr [rdi+rax*4],r10d 
instruction completes execution of the expression y[i][j] = x[j][i] * x[j][i].Like the corresponding 
C++ code, the nested processing loops in CalcMatixSquares continue executing until the index counters 
j and i (registers RDX and RCX) reach their respective termination values. The final two pop instructions 
restore registers RDI and RSI from the stack prior to execution of the ret instruction. The output for example 
Ch03 03 is shown here. 


yi[ o][ 0] = 1 y2[ o][ 0] = 1 x[ o][ 0] = 1 
yi[ o][ 1] = 16 y2[ o][ 1] = 16 x[ 1][ 0] = 4 
yi[ o][ 2] = 49 y2[ o][ 2] = 49 x[ 2][ 0] = 7 
yi[ 1][ 0] = 4 y2[ 1][ o] = 4 x[ o][ 1] = 2 
yi[ 1][ 1] = 25 y2[ 1][ 1] = 25 x[ 1][ 1] = 5 
yi[ 1][ 2] = 64 y2[ 1][ 2] = 64 x[ 2][ 1] = 8 
yi[ 2][ 0] = 9 y2[ 2][ 0] = 9 x[ o][ 2] = 3 
yi[ 2][ 1] = 36 y2[ 2][ 1] = 36 x[ 1][ 2] = 6 
yi[ 2][ 2] = 81 y2[ 2][ 2] = 81 x[ 2][ 2] = 9 
yi[ 3][ 0] = 16 y2[ 3][ 0] = 16 x[ 0][ 3] = 4 
yi[ 3][ 1] = 49 y2[ 3][ 1] = 49 x[ 1][ 3] = 7 
yi[ 3][ 2] = 100 y2[ 3][ 2] = 100 x[ 2][ 3] = 10 
yi[ 4][ 0] = 25 y2[ 4][ 0] = 25 x[ o][ 4] = 5 
yi[ 4][ 1] = 64 y2[ 4][ 1] = 64 x[ 1][ 4] = 8 
yi[ 4][ 2] = 121 y2[ 4][ 2] = 121 x[ 2][ 4] = 11 
yi[ 5][ 0] = 36 y2[ 5][ 0] = 36 x[ o][ 5] = 6 
y1[ 5][ 1] = 81 y2[ 5][ 1] = 81 x[ 1][ 5] = 9 
yi[5]| 2] =  144y2[ 5][ 2] = 144 x[ 2][ 5] = 12 


61 


CHAPTER 3 |. X86-64 CORE PROGRAMMING — PART 2 


Row-Column Calculations 


Listing 3-4 shows the source code for example Ch03 04, which demonstrates how to sum the rows and 
columns of a matrix. The C++ code in Listing 3-4 includes a couple of ancillary functions named Init and 
PrintResult that perform matrix initialization and display results. The function CalcMatrixRowColSumsCpp 
illustrates the summing algorithm. This function sweeps through matrix x using a set of nested for loops. 
During each iteration, it adds the matrix element x[i][j] to the appropriate entries in the arrays row sums 
andcol sums. Function CalcMatrixRowColSumsCpp also uses the same arithmetic that you saw in the 
previous example to determine the offset of each matrix element. 


Listing 3-4. Example Ch03 04 


#include "stdafx.h" 
#include <iostream> 
#include «iomanip» 
#include «random» 


using namespace std; 


extern "C" int CalcMatrixRowColSums (int* row sums, int* col sums, const int* x, int nrows, 
int ncols); 


void Init(int* x, int nrows, int ncols) 


( 
unsigned int seed - 13; 
uniform int distribution«» d (1, 200}; 
default random engine rng {seed}; 
for (int i = 0; i « nrows * ncols; i++) 
x[i] = d(rng); 
) 


void PrintResult(const char* msg, const int* row sums, const int* col sums, const int* x, 
int nrows, int ncols) 

const int w = 6; 

const char nl = ‘\n'; 


cout << msg; 


cout << "----------------------------------------- \n"; 


for (int i = 0; i « nrows; i++) 


( 
for (int j = 0; j « ncols; j++) 
cout << setw(w) << x[i* ncols + j]; 
cout «« " " «« setw(w) «« row sums[i] «« nl; 
) 
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cout «« nl; 
for (int i = 0; i < ncols; i++) 
cout << setw(w) << col sums[i]; 
cout << nl; 
int CalcMatrixRowColSumsCpp(int* row sums, int* col sums, const int* x, int nrows, int ncols) 
int rc = 0; 


if (nrows > 0 && ncols > 0) 


( 
for (int j = 0; j « ncols; j++) 
col sums[j] = 0; 
for (int i = 0; i « nrows; i++) 
{ 
row sums[i] = 0; 
int k - i * ncols; 
for (int j = 0; j « ncols; j++) 
int temp = x[k + j]; 
row sums[i] += temp; 
col sums[j] += temp; 
) 
) 
ic = 4; 
) 
return rc; 
) 
int main() 
{ 


const int nrows = 7; 
const int ncols = 5; 
int x[nrows][ncols]; 


Init((int*)x, nrows, ncols); 


int row sumsi[nrows], col sumsi[ncols]; 
int row sums2[nrows], col sums2[ncols]; 
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const char* msg1 = "\nResults using CalcMatrixRowColSumsCpp Nn"; 
const char* msg2 = "\nResults using CalcMatrixRowColSums ^n"; 


CalcMatrixRowColSumsCpp(row sums1, col sumsi, (int*)x, nrows, ncols); 
CalcMatrixRowColSums (row sums2, col sums2, (int*)x, nrows, ncols); 


int rci 
int rc2 


if (rc1 == 0) 
cout << "CalcMatrixRowSumsCpp failed\n"; 
else 
PrintResult(msgi, row sumsi, col sumsi, (int*)x, nrows, ncols); 


if (rc2 == 0) 
cout << "CalcMatrixRowSums_ failed\n"; 
else 
PrintResult(msg2, row sums2, col sums2, (int*)x, nrows, ncols); 


return 0; 


P ‘oi a A A No EE SR Om OMS On mé où aia 
3 Ch03 04.asm 

> eee tt 0 me Dé ds i D pa aiaa GA a, aG ee ne ee nt on ia 
; extern "C" int CalcMatrixRowColSums (int* row sums, int* col sums, const int* x, int 
nrows, int ncols) 

; 

; Returns: O = nrows <= 0 or ncols <= 0, 1 = success 


.code 
CalcMatrixRowColSums proc frame 


; Function prolog 


push rbx ;save caller's rbx 
.pushreg rbx 
push rsi ;save caller's rsi 
.pushreg rsi 
push rdi ;save caller's rdi 
.pushreg rdi 
.endprolog 

; Make sure nrows and ncols are valid 
xor eax,eax ;set error return code 
cmp r9d,0 
jle InvalidArg ;jump if nrows <= 0 
mov r10d,[rsp+64] ;r10d = ncols 
cmp r10d,0 
jle InvalidArg ;jump if ncols <= 0 
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; Initialize elements of col sums array to zero 


mov 
mov 
mov 
xor 
rep 


rbx, rcx 
rdi, rdx 
ecx,r10d 
eax,eax 
stosd 


;temp save of row sums 
;rdi = col sums 

51cx = ncols 

;eax = fill value 
;fill array with zeros 


; The code below uses the following registers: 


i^ TEX 
;  r9d 
;  eax 
; edi 
> x= 


=i 
1 
x 


row_sums 
NYOWS 


* ncols 


rdx = col sums 
r10d = ncols 
ebx = j 


esi = i * ncols + j 


riid = x[i][j] 


; Initialize outer loop variables. 


Lp1: 


; Inner 
Lp2: 


mov 
xor 


rcx, rbx 
eax, eax 


mov dword ptr [rcx+rax*4],0 


Xor 
mov 


ebx,ebx 
edi,eax 


imul edi,r10d 


loop 


mov 
add 


mov r11d,[r8+rsi*4] 
add [rcx+rax*4],r11d 
add [rdx+rbx*4],r11d 


esi,edi 
esi,ebx 


; Is the inner loop finished? 


inc 
cmp 


ebx 
ebx, r10d 


jl Lp2 


; Is the outer loop finished? 


inc 
cmp 


eax 
eax, 19d 


jl Lpi 


mov 


eax,1 


; Function epilog 
InvalidArg: 


CalcMatrixRowColSums  endp 


pop 
pop 
pop 
ret 


end 


rdi 
rsi 
rbx 


,iCX = row sums 


;edi = i * ncols 


;esi - i * ncols 

;esi = i * ncols + j 

;r11d = x[i * ncols + j] 

;row sums[i] += x[i * ncols + j] 
;col sums[j] += x[i * ncols + j] 


jj ^41 


;jump if j < ncols 


si 421 
;jump if i < nrows 


;set success return code 


;restore NV registers and return 
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The assembly language function CalcMatrixRowColSums implements the same algorithm as the C++ 
code. Following the function prolog, the arguments nrows and ncols are tested for validity. Note that the 
argument ncols was passed on the stack, as illustrated in Figure 3-2. The elements of col sums are then 
initialized to zero using a rep stosd (Repeat Store String Doubleword) instruction. This instruction stores 
the contents of EAX, which was initialized to zero, to the memory location specified by RDI; it then adds 
four to RDI so that it points to the next array element. The rep mnemonic is an instruction prefix that tells 
the processor to repeat execution of the stosd instruction. Specifically, this prefix instructs the CPU to 
decrement RCX by 1 following each store action and repeat execution of the stosd instruction until RCX 
equals zero. You'll take a closer look at the x86-64 string processing instructions later in this chapter. 


High 
Memory 
ERTNNEL 


Low 
Memory RSP 


Stack Registers 


[x] = Undefined 


Figure 3-2. Stack and register contents after prolog in CalcMatrixRowColSums_ 


In function CalcMatrixRowColSums , R8 holds the base address of the matrix x. Registers EAX and EBX 
contain the row and column indices i and j, respectively. Each outer loop starts by initializing row sums [i] 
(RCX points to row sums) to zero and calculating the intermediate value i * ncols (R10D contains ncols). 
Within the inner loop, the final offset of matrix element x[i][j] is calculated. A mov r11d,[r8+rsi*4] 
instruction loads x[i][j] into R11D. The instructions add [rcx«rax*4] , r11d and add[rdx+rbx*4],r11d 
update the totals for row sums[i] and col sums[;j]. Note that these two instructions use destination 
operands in memory instead of registers. Figure 3-3 illustrates the memory addressing that's used to 
reference elements in x, row sums, and col sums. 
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x[7][5] i=2;j=3;esi=i* ncols* j 
=> TEE gemein ed 
T 


mov riid,[r8+rsi*4] 


row sums[7] i-eax-2 
cor> | EORR RSC en 
T 


add [rex+rax*4],r11d 


col sums[5] j=ebx=3 


ui | |: 
ii 


add [rdx+rbx*4],r11d 


Figure 3-3. Memory addressing used in function CalcMatrixRowColSums_ 


The nested processing loops in CalcMatrixRowColSums repeatuntil all of the elements in matrix 
x have been added to the correct elements in both row sums and col sums. Note that this function uses 
32-bit registers for its counters and indices. Using 32-bit registers often requires less code space than 
64-bit registers, as discussed earlier in this chapter. The code in CalcMatrixRowColSums also exploits 
BaseReg+IndexReg*ScaleFactor memory addressing, which simplifies the loading of elements from matrix 
x and the updating of elements in both row sums and col sums. Here are the results for example Ch03 04. 


Results using CalcMatrixRowColSumsCpp 


621 670 779 781 673 
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114 84 109 179 17 503 
140 126 28 52 55 401 
126 100 186 115 145 672 


621 670 779 781 673 


Structures 


A structure is a programming language construct that facilitates the definition of new data types using one 
or more existing data types. In this section, you'll learn how to define and use a common structure in both a 
C++ and x86-64 assembly language function. You'll also learn how to deal with potential semantic issues that 
can arise when working with a common structure that's manipulated by software functions written using 
different programming languages. 

In C++ a structure is equivalent to a class. When a data type is defined using the keyword struct instead 
of class, all members are public by default. A C++ struct that's declared sans any member functions or 
operators is equivalent to a C-style structure such as typedef struct ( ... } MyStruct;.C++ structure 
declarations are usually placed in a header (.h) file so they can be easily referenced by multiple C++ files. 
The same technique also can be employed to declare and reference structures that are used in assembly 
language code. Unfortunately, it is not possible to declare a single structure in a header file and include this 
file in both C++ and assembly-language source code files. If you want to use the "same" structure in both 
C++ and assembly language code, it must be declared twice and both declarations must be semantically 
equivalent. 

Listing 3-5 shows the C++ and x86 assembly language source code for example Ch03 05. In the C++ 
code, a simple structure named TestStruct is declared. This structure uses sized integer types instead of the 
more common C++ types to highlight the exact size of each member. The other noteworthy detail regarding 
TestStruct is the inclusion of the structure member Pad8. While not explicitly required, the presence of 
this member helps document the fact that the C++ compiler defaults to aligning structure members to their 
natural boundaries. The assembly language version of TestStruct looks similar to its C++ counterpart. The 
biggest difference between the two is that the assembler does not automatically align structure members 
to their natural boundaries. Here the definition of Pad8 is required; without the member Pad8, the C++ and 
assembly language versions would be semantically different. The ? symbol that's included with each data 
element declaration notifies the assembler to perform storage allocation only and is customarily used to 
remind the programmer that structure members are always uninitialized. 


Listing 3-5. Example Ch03 05 


#include "stdafx.h" 
include <iostream> 
#include «iomanip» 
#include <cstdint> 


using namespace std; 


struct TestStruct 


{ 
int8 t Val8; 
int8 t Pad8; 
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int16 t Val16; 
int32 t Val32; 
int64 t Val64; 
extern "C" int64 t CalcTestStructSum (const TestStruct* ts); 
int64 t CalcTestStructSumCpp(const TestStruct* ts) 
i 
return ts-»Val8 + ts->Val16 + ts-»Val32 + ts-»Val64; 
) 
int main() 
{ 
TestStruct ts; 
ts.Val8 = -100; 
ts.Val16 = 2000; 
ts.Val32 = -300000; 
ts.Val64 = 40000000000; 
int64 t sumi = CalcTestStructSumCpp(&ts); 
int64 t sum2 = CalcTestStructSum (8ts); 
cout << "ts1.Val8 = " << (int)ts.Val8 << 'An'; 
cout << "ts1.Val16 = " << ts.Val16 << '\n'; 
cout << "ts1.Val32 = " << ts.Val32 << '\n'; 
cout << "ts1.Val16 = " << ts.Val64 << '\n'; 
cout << ‘\n'; 
cout << "sumi = " << sum1 << ‘\n'; 
cout << "sum2 = " << sum2 << ‘\n'; 
return 0; 
) 
5 mnmmamammnmemmnmemmmemenmmmmmmmmmmmmem«,smxmm…s…<s=……mmm 
5 Ch03 05.asm 
mcm ee dt uS c Pip cabe cL Leld EIE 
TestStruct struct 
Val8 byte ? 
Pad8 byte ? 
Val16 word ? 
Val32 dword ? 
Val64  qword ? 
TestStruct ends 
; extern "C" int64 t CalcTestStructSum (const TestStruct* ts); 
5 
; Returns: Sum of structure's values as a 64-bit integer. 
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.code 
CalcTestStructSum proc 


; Compute ts-»Val8 + ts-»Val16, note sign extension to 32-bits 
movsx eax,byte ptr [rcx«TestStruct.Val8] 
movsx edx,word ptr [rcx+TestStruct.Val16] 
add eax,edx 


; Sign extend previous result to 64 bits 
movsxd rax,eax 


; Add ts-»Val32 to sum 
movsxd rdx, [rcx«TestStruct.Val32] 
add rax,rdx 


; Add ts-»Val64 to sum 
add rax,[rcx+TestStruct.Val64] 
ret 


CalcTestStructSum  endp 
end 


The C++ function CalcTestStructSumCpp sums the members of the TestStruct instance that's passed 
to it. The x86 assembly language function CalcTestStructSum performs the same operation. The movsx 
eax,byte ptr [rcx«TestStruct.Val8] and movsx edx,word ptr [rcx+TestStruct.Val16] instructions 
load sign-extended copies of structure members TestStruct.Val8 and TestStruct.Val16 into registers EAX 
and EDX, respectively. These instructions also illustrate the syntax that is required to reference a structure 
member in an assembly language instruction. From the perspective of the assembler, the movsx instructions 
are instances of BaseReg+Disp memory addressing since the assembler ultimately converts structure 
members TestStruct.Val8 and TestStruct.Val16 into constant displacement values. 

Next, the function CalcTestStructSum uses an add eax, edx instruction to sum structure members 
TestStruct.Val8 and TestStruct.Val16.It then sign-extends this sum to 64 bits using amovsxd rax,eax 
instruction. The next instruction, movsxd rdx, [rcx+TestStruct.Val32], loads a sign-extended copy 
TestStruct.Val32 into RDX and adds this value to intermediate sum that's in RAX. The instruction add 
rax, [rcx+TestStruct.Val64] adds the value structure member TestStruct.Val64 to the running sum in 
RAX, which generates the final result. The Visual C++ calling convention requires 64-bit return values to be 
placed in register RAX. Since the final result is already in the required register, no additional mov instructions 
are necessary. Here's the output for example Ch03 05. 


ts1.Val8 = -100 
ts1.Val16 = 2000 
ts1.Val32 - -300000 
ts1.Val16 - 40000000000 


sum1 
sum2 


39999701900 
39999701900 
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Strings 


The x86-64 instruction set includes several useful instructions that process and manipulate strings. In x86 
parlance, a string is a contiguous sequence of bytes, words, doublewords, or quadwords. Programs can 

use the x86 string instructions to process conventional text strings such as “Hello, World.” They also can be 
employed to perform operations using the elements of an array or similarly-ordered data in memory. In this 
section, you'll examine some sample code that demonstrates how to use the x86-64 string instructions with 
text strings and integer arrays. 


Counting Characters 


Listing 3-6 shows the C++ and assembly language code, for example Ch03 06. This example explains how to 
use the lodsb (Load String Byte) instruction to count the number of character occurrences in a text string. 


Listing 3-6. Example Ch03 06 


#include "stdafx.h" 
#include <iostream> 


using namespace std; 
extern "C" unsigned long long CountChars (const char* s, char c); 


int main() 
{ 
const char nl = ‘\n'; 
const char* sO = "Test string: "; 


const char* s1 = " SearchChar: "; 


const char* s2 = " Count: "; 

char c; 

const char* s; 

S - "Four score and seven seconds ago, ..."; 


cout << nl << sO << s << nl; 


C m8 
cout << s1 << c << s2 << CountChars (s, c) << nl; 
Qm r5 
cout << s1 << c << s2 << CountChars (s, c) << nl; 
Caz s 
cout << s1 << c << s2 << CountChars (s, c) << nl; 
c= 'F'; 
cout << s1 << c << s2 << CountChars (s, c) << nl; 
e 


cout << s1 << c << s2 << CountChars_(s, c) << nl; 
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s - "Red Green Blue Cyan Magenta Yellow"; 
cout << nl << so << s << nl; 


c= 'e'; 
cout << s1 << c << s2 << CountChars_(s, c) << nl; 
c='w; 
cout << s1 << c << s2 << CountChars (s, c) << nl; 
ee “l3 
cout << s1 << c << s2 << CountChars (s, c) << nl; 
c= '0'; 
cout << s1 << c << s2 << CountChars (s, c) << nl; 
ccs 


cout << s1 << c << s2 << CountChars (s, c) << nl; 


return 0; 


; extern "C" unsigned long long CountChars (const char* s, char c); 


; Description: This function counts the number of occurrences 
; of a character in a string. 


; Returns: Number of occurrences found. 


.code 
CountChars proc frame 


; Save non-volatile registers 


push rsi ;save caller's rsi 
.pushreg rsi 
.endprolog 
; Load parameters and initialize count registers 
mov rsi,rcx jrysi=s 
mov cl,dl $cl = C 
xor edx,edx ;rdx = Number of occurrences 
xor r8d,r8d 318 = 0 (required for add below) 


; Repeat loop until the entire string has been scanned 


@@: lodsb ;load next char into register al 
or al,al ;test for end-of-string 
jz GF ;jump if end-of-string found 
cmp al,cl ;test current char 
sete r8b 3r8b = 1 if match, O otherwise 
add rdx,r8 ;update occurrence count 
jmp GB 

QQ: mov rax,rdx ;rax - number of occurrences 
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; Restore non-volatile registers and return 
pop rsi 
ret 

CountChars  endp 
end 


The assembly language function CountChars accepts two arguments: a text string pointer s and a 
search character c. Both arguments are of type char, which means that each text string character and the 
search character require one byte of storage. The function CountChars starts with a function prolog that 
saves the caller's RSI on the stack. It then loads the text string pointer s into RSI and the search character c 
into register CL. An xor edx,edx instruction initializes register RDX to zero for use as a character occurrence 
counter. The processing loop uses the lodsb instruction to read each text string character. This instruction 
loads register AL with the contents of the memory pointed to by RSI; it then increments RSI by one so that it 
points to the next character. 

Next, the function CountChars usesanor al,alinstruction to test for the end-of-string character 
(" No*). This instruction sets the zero flag (RFLAGS.ZF) if register AL is equal to zero. If the end-of-string 
character is not found, a cmp al, cl instruction compares the current text string character to the search 
character. The subsequent sete r8b (Set Byte if Equal) instructions loads register R8B with a value of one 
if a character match is found; otherwise R8B is set to zero. One important item that should be noted here is 
that the sete instruction does not modify the upper 56 bits of register R8. Whenever the destination operand 
ofan instruction is an 8-bit or 16-bit register, the upper 56 or 48 bits of the corresponding 64-bit register 
are unaffected by the specified operation. Following the sete instruction is an add rdx,r8 instruction 
that updates the occurrence counter. This process is repeated until the end-of-string character is found. 
Following completion of the text string scan, the final occurrence count is moved into register RAX and 
returned to the caller. The output for example Ch03 06 is as follows: 


Test string: Four score and seven seconds ago, 
SearchChar: s Count: 
SearchChar: o Count: 
SearchChar: z Count: 
SearchChar: F Count: 
SearchChar: . Count: 


WRORR 


Test string: Red Green Blue Cyan Magenta Yellow 
SearchChar: e Count: 
SearchChar: w Count: 
SearchChar: 1 Count: 
SearchChar: Q Count: 
SearchChar: n Count: 


WW © w KR o 


A version of CountChars that processes strings of type wchar_t instead of char can be easily created by 
changing the lodsb instruction to a lodsw (Load String Word) instruction. 16-bit registers would also need to 
be used instead of 8-bit registers for the character matching instructions. The last character of an x86 string 
instruction mnemonic indicates the size of the operand that is processed. 
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String Concatenation 


The concatenation of two text strings is a common operation that is performed by many programs. C++ 
programs can use the library functions strcat, strcat_s, wcscat, and wcscat s to concatenate two strings. 
One limitation of these functions is that they can process only a single source string. Multiple calls are 
necessary to concatenate several strings together. The next example, named Ch03 07, demonstrates how to 
use the scas (Scan String) and movs (Move String) instructions to concatenate multiple strings. Listing 3-7 
shows the C++ and x86-assembly language source code. 


Listing 3-7. Example Ch03 07 


#include "stdafx.h" 
#include <iostream> 
#include «string» 


using namespace std; 


extern "C" size t ConcatStrings (char* des, size t des size, const char* const* src, size t 
src n); 


void PrintResult(const char* msg, const char* des, size t des len, const char* const* src, 
size t src n) 


( 
string s test; 
const char nl = ‘\n'; 
cout «« nl «« "Test case: " «« msg «« nl; 
cout «« " Original Strings" «« nl; 
for (size t i = 0; i < src n; i++) 
{ 
const char* s1 = (strlen(src[i]) == 0) ? "<empty string>" : src[i]; 
cout << " LE << «€ " "«€51 << nl; 
s test += src[i]; 
) 
const char* s2 = (strlen(des) == 0) ? "«empty string»" : des; 
cout «« " Concatenated Result" «« nl; 
cout «« " " <4 S2. << ñl; 
if (s_test != des) 
cout << " Error - test string compare failed" << nl; 
} 
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int main() 

{ 
// Destination buffer size OK 
const char* src1[] = { "One ", "Two ", "Three 
size t srci n = sizeof(src1) / sizeof(char*); 
const size t des1 size = 64; 
char des1[des1 size]; 


, "Four" }; 
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size t desi len = ConcatStrings (des1, desi size, srci, srci n); 
PrintResult("destination buffer size OK", desi, desi len, srci, srci n); 


// Destination buffer too small 

const char* src2[] = { "Red ", "Green ", "Blue ", "Yellow " }; 
size t src2 n - sizeof(src2) / sizeof(char*); 

const size t des2 size - 16; 

char des2[des2 size]; 


size t des2 len - ConcatStrings (des2, des2 size, src2, src2 n); 
PrintResult("destination buffer too small", des2, des2 len, src2, src2 n); 


// Empty source string 

const char* src3[] = { "Plane ", "Car ", "", "Truck ", "Boat " 
size t src3 n - sizeof(src3) / sizeof(char*); 

const size t des3 size - 128; 

char des3[des3 size]; 


, Train ", "Bicycle " }; 


size t des3 len - ConcatStrings (des3, des3 size, src3, src3 n); 
PrintResult("empty source string", des3, des3 len, src3, src3 n); 


// Ml strings empty 

const chart src4[] = ( "", "", "", "" fs; 
size t src4 n = sizeof(src4) / sizeof(char*); 
const size t des4 size - 42; 

char des4[des4 size]; 


size t des4 len - ConcatStrings (des4, des4 size, src4, src4 n); 
PrintResult("all strings empty", des4, des4 len, src4, src4 n); 


// Minimum des size 

const char* src5[] = { "1", "22", "333", "4444" }; 
size t src5 n - sizeof(src5) / sizeof(char*); 
const size t des5 size - 11; 

char des5[des5 size]; 


size t des5 len - ConcatStrings (des5, des5 size, src5, src5 n); 


PrintResult("minimum des size", des5, des5 len, src5, src5 n); 


return 0; 
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; extern "C" size t ConcatStrings (char* des, size t des size, const char* const* src, 
size t src n); 

5 

; Returns: -1 Invalid ‘des size’ 

F n >= 0 Length of concatenated string 


.code 
ConcatStrings proc frame 


; Save non-volatile registers 

push rbx 

.pushreg rbx 

push rsi 

.pushreg rsi 

push rdi 

.pushreg rdi 
.endprolog 


; Make sure des size and src n are valid 
mov rax,-1 ;set error code 


test rdx,rdx ;test des size 
jz InvalidArg ;junp if des size is O 


test r9,r9 ;test src n 
jz InvalidArg ;junp if src n is 0 


; Registers used processing loop below 


;  rbx = des rdx - des size 
; 18 = src r9 - src n 

;  r10 = des index rii.s.i 

; rcx = string length 


;  rsi, rdi = pointers for scasb & movsb instructions 


; Perform required initializations 


xor r10,r10 ;des index = 0 
xor r11,r11 $1.5 0 

mov rbx,rcx ;rbx = des 

mov byte ptr [rbx],0 ;*des = 'No' 


; Repeat loop until concatenation is finished 


Loopi: mov rax,r8 ¿yax = 'src' 
mov rdi, [rax+r11*8 ] ;rdi = src[i] 
mov rsi,rdi ;rsi - src[i] 


; Compute length of s[i] 
xor eax,eax 
mov rcx,-1 
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repne scasb ;find '\o' 
not rcx 
dec rcx ;rcx = len(src[i]) 


; Compute des index + src len 


mov rax,r10 ;rax = des index 

add rax,rcx ;des index + len(src[i]) 

cmp rax,rdx ;is des index + src len >= des size? 
jge Done ;junp if des is too small 


; Update des index 
mov rax,r10 ;des index old - des index 
add r10,rcx ;des index += len(src[i]) 


; Copy src[i] to &des[des index] (rsi already contains src[i]) 


inc rcx ;rcx = len(src[i]) + 1 
lea rdi,[rbx+rax] ;rdi = &des[des index old] 
rep movsb ;perform string move 


; Update i and repeat if not done 


inc r11 si += 1 
cmp r11,r9 
jl Loopi ;junp if i < src n 


; Return length of concatenated string 
Done: mov rax,r10 ;rax = des index (final length) 
; Restore non-volatile registers and return 


InvalidArg: 
pop rdi 
pop rsi 
pop rbx 
ret 
ConcatStrings endp 
end 


Let's begin by examining the C++ code in Listing 3-7. It starts with a declaration statement for the 
assembly language function ConcatStrings , which includes four parameters: des is the destination buffer 
for the final string; des sizeisthe size of des in characters; and parameter src points to an array that 
contains pointers to src n text strings. In 64-bit Visual C++ programs, the type size t is equivalent to a 64- 
bit unsigned integer. The function ConcatStrings returns the length of des or -1 if the supplied value for 
des sizeisless than or equal to zero. 

The test cases presented in main illustrate use of ConcatStrings _. If, for example, src points to a 
text string array consisting of “Red”, "Green", “Blue”, the final string in des is "RedGreenBlue" provided 
des is large enough to contain the result. If des size is insufficient, ConcatStrings produces a partially 
concatenated string. For example, a des size equal to 10 would yield "RedGreen" as the final string. 

Following its prolog, the function ConcatStrings checks argument value des size for validity 
using a test rdx,rdx instruction. This instruction performs a bitwise AND of its two operands and sets 
the parity (RFLAGS.PF), sign (RFLAGS.SF), and zero (RFLAGS.ZF) flags based on the result (the carry 
(RFLAGS.CF) and overflow (RFLAGS.OF) are set to zero). The result of the bitwise AND operation is 
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not saved. The test instruction is often used as an alternative to the cmp instruction, especially when a 
function needs to ascertain if a value is less than, equal to, or greater than zero. Using a test instruction 
may also be more efficient in terms of code space. In this instance, the test rdx,rdx instruction requires 
fewer opcode bytes than a cmp rdx,0 instruction. Register initialization is carried out next prior to the 
start of the concatenation processing loop. 

The subsequent block of instructions marks the top of the concatenation loop that begins by loading 
registers RSI and RDI with a pointer to string src[i]. The length of src[i] is determined next using a repne 
scasb instruction in conjunction with several support instructions. The repne (Repeat String Operation 
While not Equal) is an instruction prefix that repeats execution of a string instruction while the condition RCX 
!= 0 && RFLAGS.ZF == Oistrue. The exact operation ofthe repne scasb (Scan String Byte) combination is 
as follows: If RCX is not zero, the scasb instruction compares the string character pointed to by RDI to the 
contents of register AL and sets the status flags according to the results. Register RDI is then automatically 
incremented by one so that it points to the next character and a count of one is subtracted from RCX. 

This string-processing operation is repeated as long as the aforementioned test conditions remain true; 
otherwise, the repeat string operation terminates. 

Prior to use ofthe repne scasb instruction, register RCX was loaded with -1. Upon completion of 
repne scasb, register RCX contains -(L + 2), where L denotes the actual length of string src[i].The value L 
is calculated using a not rcx (One's Complement Negation) instruction followed by a dec rcx (Decrement 
by 1) instruction, which is equal to subtracting 2 from the two's complement negation of -(L + 2). It should 
be noted that the instruction sequence used here to calculate the length of a text string is a well-known 
technique that dates back to the 8086 CPU. 

Following the computation of len(src[i]), a check is made to verify that the string src[i] will fit 
into the destination buffer. If the sum des index + len(src[i]) is greater than or equal to des size, the 
function terminates. Otherwise, len(src[i]) is added to des index and string src[i]is copied to the 
correct position in des using a rep movsb (Repeat Move String Byte) instruction. 

The rep movsb instruction copies the string pointed to by RSI to the memory location pointed to by RDI 
using the length specified in RCX. An inc rcxinstruction is executed before the string copy to ensure that 
the end-of-string terminator '\0' is also transferred to des. Register RDI is initialized to the correct offset 
in des using a lea rdi, [rbx+rax] (Load Effective Address) instruction, which computes the address of 
the specified source operand (i.e., lea calculates RDI = RBX + RAX). The concatenation loop can use a lea 
instruction since register RBX points to the start of des and RAX contains the value of des index prior to 
its addition with len(src[i]). Subsequent to the string copy operation, the value of i is updated and if it's 
less than src n, the concatenation loop is repeated. Following completion of the concatenation operation, 
register RAX is loaded with des index, which is the length of the final string in des. Here's the output of 
example Ch03 07. 


Test case: destination buffer size OK 
Original Strings 
i:0 One 
i:1 Two 
i:2 Three 
i:3 Four 
Concatenated Result 
One Two Three Four 


Test case: destination buffer too small 
Original Strings 
i:0 Red 
i:1 Green 
i:2 Blue 
i:3 Yellow 
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Concatenated Result 
Red Green Blue 
Error - test string compare failed 


Test case: empty source string 
Original Strings 
i:0 Plane 
Car 
«empty string» 
Truck 
Boat 
Train 
i:6 Bicycle 
Concatenated Result 
Plane Car Truck Boat Train Bicycle 


. þe pe He He He 
Oui» NN PR 


Test case: all strings empty 

Original Strings 

i:0 «empty string» 

i:1 «empty string» 

i:2 «empty string» 

i:3 «empty string» 
Concatenated Result 

«empty string» 


Test case: minimum des size 
Original Strings 
i:0 1 
i:1 22 
i:2 333 
i:3 4444 
Concatenated Result 
1223334444 


Comparing Arrays 
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Besides text strings, the x86 string instructions also can be used to perform operations on other 
contiguously-ordered data elements. The next source code example demonstrates how to use the cmps 
(Compare String Operands) instruction to compare the elements of two arrays. Listing 3-8 contains the C++ 


and x86-64 assembly language source code for example Ch03 08. 


Listing 3-8. Example Ch03 08 


#include "stdafx.h" 
#include <iostream> 
#include «iomanip» 
#include «random» 
#include «memory» 
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using namespace std; 
extern "C" long long CompareArrays (const int* x, const int* y, long long n); 


void Init(int* x, int* y, long long n, unsigned int seed) 


{ 
uniform int distribution«» d (1, 10000}; 
default random engine rng {seed}; 
for (long long i = 0; i < n; i++) 
x[i] = y[i] = d(xng); 
} 


void PrintResult(const char* msg, long long result1, long long result2) 


{ 
cout << msg << '\n'; 
cout << " expected = " << result1; 


cout << " actual = " << result2 << "\n\n"; 
} 
int main() 
{ 


// Allocate and initialize the test arrays 
const long long n = 10000; 

unique ptr«int[]» x array {new int[n]}; 
unique ptr«int[]» y array {new int[n]}; 
int* x = x array.get(); 

int* y - y array.get(); 


Init(x, y, n, 11); 

cout << "Results for CompareArrays - array size = " << n << "\n\n"; 
long long result; 

// Test using invalid array size 

result - CompareArrays (x, y, -n); 


PrintResult("Test using invalid array size", -1, result); 


// Test using first element mismatch 


x[0] += 1; 
result - CompareArrays (x, y, n); 
x[0] -= 1; 


PrintResult("Test using first element mismatch", O, result); 


// Test using middle element mismatch 


y[n / 2] -= 2; 
result - CompareArrays (x, y, n); 
y[n / 2] += 2; 


PrintResult("Test using middle element mismatch", n / 2, result); 
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// Test using last element mismatch 

x[n - 1] *= 3; 

result - CompareArrays (x, y, n); 

x[n - 1] /= 3; 

PrintResult("Test using last element mismatch", n - 1, result); 


// Test with identical elements in each array 
result - CompareArrays (x, y, n); 


PrintResult("Test with identical elements in each array", n, result); 
return 0; 


; extern "C" long long CompareArrays (const int* x, const int* y, long long n) 


; Returns -1 Value of 'n' is invalid 
E O«-ic«n Index of first non-matching element 
s n All elements match 

.code 


CompareArrays proc frame 


; Save non-volatile registers 
push rsi 
.pushreg rsi 
push rdi 
.pushreg rdi 
.endprolog 
; Load arguments and validate 'n' 
mov rax,-1 ;rax - return code for invalid n 
test r8,r8 
jle GF ;jump if n <= 0 


; Compare the arrays for equality 


mov rsi,rcx jrsi = x 
mov rdi,rdx jrdi = y 
mov rcx,r8 ;rcx =n 
mov rax,r8 Tax =n 
repe cmpsd 
je @F jarrays are equal 
; Calculate index of first non-match 
sub rax,rcx ;rax = index of mismatch + 1 


dec rax ;rax - index of mismatch 
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; Restore non-volatile registers and return 


(QQ pop rdi 
pop rsi 
ret 
CompareArrays  endp 
end 


The assembly language function CompareArrays compares the elements of two integer arrays and 
returns the index of the first non-matching element. If the arrays are identical, the number of elements 
is returned. Otherwise, -1 is returned to indicate an error. Following the function prolog, a test r8,r8 
instruction checks argument value n to see if it's less than or equal to zero. As you learned in the previous 
section, this instruction performs a bitwise AND ofthe two operands and sets the status flags RFLAGS.PF, 
RFLAGS.SE and RFLAGS.ZF based on the result (RFLAGS.CF and RFLAGS.OF are cleared). The result of the 
AND operation is discarded. If argument value n is invalid, the jle @F instruction skips over the compare 
code. 

The actual compare code begins by loading register RSI with a pointer to x and RDI with pointer to y. 
The number of elements is then loaded into register RCX. The arrays are compared using a repe cmpsd 
(Compare String Doubleword) instruction. This instruction compares the two doublewords pointed to by 
RSI and RDI and sets the status flags according to the results. Registers RSI and RDI are incremented by four 
after each compare operation (the value 4 is used since that's the size of a doubleword in bytes). The repe 
(Repeat While Equal) prefix instructs the processor to repeat the cmpsd instruction as long as the condition 
RCX != 0 && RFLAGS.ZF == 1istrue. Upon completion of the cmpsd instruction, a conditional jump is 
performed if the arrays are equal (RAX already contains the correct return value) or the index of the first 
non-matching elements is calculated. Here's the output for example Ch03 08. 


Results for CompareArrays - array size = 10000 


Test using invalid array size 
expected = -1 actual = -1 


Test using first element mismatch 
expected = O actual = 0 


Test using middle element mismatch 
expected = 5000 actual = 5000 


Test using last element mismatch 
expected = 9999 actual = 9999 


Test with identical elements in each array 
expected = 10000 actual = 10000 


Array Reversal 


The final source code example of this section demonstrates use of the lods (Load String) instruction to 
reverse the elements of an array. Unlike this section's previous source code examples, the processing loop 
of example Ch03 09 traverses the source array starting from the last element and ending at the first element. 
Executing a reverse array traversal requires the direction flag (RFLAGS.DF) to be modified in a manner that 
is compatible with the Visual C++ runtime environment as elucidated in this example. Listing 3-9 shows the 
C++ and x86-64 assembly language source code for example Ch03 09. 
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Listing 3-9. Example Ch03_09 


[l------------------------------------------------ 
// Ch03 09.cpp 
[l------------------------------------------------ 
#include "stdafx.h" 

#include <iostream> 

#include <iomanip> 

#include <random> 


using namespace std; 
extern "C" int ReverseArray_(int* y, const int* x, int n); 


void Init(int* x, int n) 


{ 
unsigned int seed = 17; 
uniform int_distribution<> d {1, 1000}; 
default random engine rng {seed}; 
for (int i = 0; i < n; i+) 
x[i] = d(rng); 
} 
int main() 
{ 


const int n = 25; 
int x[n], y[n]; 


Init(x, n); 
int rc = ReverseArray (y, x, n); 


if (rc != 0) 
{ 


cout << "AnResults for ReverseArray\n"; 


const int w = 5; 


bool compare error - false; 


for (int i - 


{ 


0; i < n && !compare error; i++) 


cout << 1: 


cout << y: 


cout «« x: 


<< setw(w) << i; 
<< setw(w) << y[i]; 
<< setw(w) << x[i] << 


An: 


if (x[i] != y[n - 1 - i]) 
compare error - true; 


X86-64 CORE PROGRAMMING — PART 2 
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if (compare error) 

cout << "ReverseArray compare error\n"; 
else 

cout << "ReverseArray compare OK\n"; 


} 
else 
cout << "ReverseArray () failed\n"; 
return 0; 
} 
> mm 
; Ch03 09.asm 


; extern "C" int ReverseArray (int* y, const int* x, int n); 
; 
; Returns O = invalid n, 1 = success 


.code 
ReverseArray proc frame 


; Save non-volatile registers 


push rsi 

.pushreg rsi 

push rdi 

.pushreg rdi 

.endprolog 
; Make sure n is valid 

xor eax,eax ;error return code 

test r8d,r8d jis n <= 0? 

jle InvalidArg ;jump if n <= 0 
; Initialize registers for reversal operation 

mov rsi,rdx jrsi = x 

mov rdi,rcx jrdi = y 

mov ecx,r8d jICX =n 

lea rsi,[rsi+rcx*4-4] ;rsi = &x[n - 1] 


; Save caller's RFLAGS.DF, then set RFLAGS.DF to 1 


pushfq ;save caller's RFLAGS.DF 
std ;RFLAGS.DF = 1 
; Repeat loop until array reversal is complete 
@@: lodsd ;eax = *x-- 
mov [rdi],eax ;*y = eax 
add rdi,4 jytt 
dec rcx ;n-- 
jnz GB 
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; Restore caller's RFLAGS.DF and set return code 
popfq ;restore caller's RFLAGS.DF 
mov eax,1 ;set success return code 


; Restore non-volatile registers and return 


InvalidArg: 
pop rdi 
pop rsi 
ret 
ReverseArray endp 
end 


The function ReverseArray copies the elements of a source array to a destination array in reverse 
order. This function requires three parameters: a pointer to a destination array named y, a pointer to a 
source array named x, and the number of elements n. Following validation of n, registers RSI and RDI are 
initialized with pointers to the arrays x and y. A mov ecx,r8d instruction loads the number of elements 
into register RCX. In order to reverse the elements of the source array, the address of the last array element 
x[n - 1] needs to be calculated. This is accomplished using a lea rsi, [rsi+rcx*4-4] instruction, which 
computes the effective address of the source memory operand (i.e., it performs the arithmetic operation 
specified between the brackets and saves the result to register RSI). 

The Visual C++ runtime environment assumes that the direction flag (RFLAGS.DF) is always cleared. 

If an assembly language function sets RFLAGS.DF to perform auto-decrementing with a string instruction, 
the flag must be cleared before returning to the caller or using any library functions. The function 
ReverseArray partially fulfills this requirement by saving the current state of RFLAGS.DF on the stack 
using the pushfq (Push RFLAGS Register onto Stack) instruction. It then uses the std (Set Direction Flag) 
instruction to set RFLAGS.DF to 1. The duplication of array elements from x to y is straightforward. A lodsd 
(Load String Doubleword) instruction loads an element from x into EAX and subtracts four from register 
RSI. The next instruction, mov [rdi],eax, saves this value to the element in y that is pointed to by RDI. An 
add rdi,4 instruction points EDI to the next element in y. Register RCX is then decremented and the loop is 
repeated until the array reversal is complete. 

Following the reverse array loop, a popfq (Pop Stack into RFLAGS Register) instruction restores the 
original state of RFLAGS.DF. One question that might be asked at this point is if the Visual C++ runtime 
environment assumes that RFLAGS.DF is always cleared, why doesn't the function ReverseArray usea 
cld (Clear Direction Flag) instruction to restore RFLAGS.DF instead of a pushfq/popfq sequence? Yes, the 
Visual C++ runtime environment assumes that RFLAGS.DF is always cleared, but it cannot enforce this policy 
during program execution. If ReverseArray were to be included in a DLL, it could conceivably be called by a 
function written in a language that uses a different default state for the direction flag. Using pushfq and popfq 
ensures that the state of the caller's direction is always properly restored. Here is the output example Ch03 09. 


Results for ReverseArray 


i: O y: 583 x: 560 
i: 1 y: 904 x: 586 
i: 2 y: 924 x: 752 
i: 3 y: 635 x: 743 
i; 4 y: 347 X: 51 
i: 5 y: 313 x: 370 
i: 6 y: 738 x: 809 
i: 7 y: 810 x: 214 
i: 8 y: 935 x: 823 
i: 9 y: 354 x: 456 
i: 10 y: 592 x: 13 
i: 11 y: 613 x: 240 
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1: 


. pe He He He He He de He He He He He 


12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 


y: 413 x: 413 
y: 240 x: 613 
y: 13 x: 592 
y: 456 x: 354 
y: 823 x: 2935 
y: 214 x: 810 
y: 809 x: 738 
y: 370 x: 313 
y: 511 x: 347 
y: 743 x: 635 
y: 752 x: 924 
y: 586 x: 904 
y: 560 x: 583 


ReverseArray compare OK 


Summary 


Here are the key learning points for Chapter 3: 
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The address of an element in a one-dimensional array can be calculated using the 
base address (i.e., the address of the first element) of the array, the index of the 
element, and the size in bytes of each element. The address of an element in a two- 
dimensional array can be calculated using the base address of the array, the row and 
column indices, the number of columns, and the size in bytes of each element. 


The Visual C++ calling convention designates each general-purpose register as volatile or 
non-volatile. A function must preserve the contents of any non-volatile general-purpose 
register it uses. A function should use the push instruction in its prolog to save the 
contents of a non-volatile register on the stack. A function should use the pop instruction 
in its epilog to restore the contents of any previously-saved non-volatile register. 


X86-64 assembly language code can define and use structures similar to the way 
they are used in C++. An assembly language structure may require extra padding 
elements to ensure that it's semantically equivalent to a C++ structure. 


The upper 32 bits of a 64-bit general-purpose register are set to zero in instructions 
that specify the corresponding 32-bit register as a destination operand. The upper 56 
or 48 bits of a 64-bit general-purpose register are not affected when the destination 
operand of an instruction is an 8-bit or 16-bit register. 


The x86 string instructions cmps, lods, movs, scas, and stos can be used to 
compare, load, copy, scan, or initialize text strings. They also can be used to perform 
operations on arrays and other similarly-ordered data structures. 


The prefixes rep, repe, repz, repne, and repnz can be used with a string instruction 
to repeat a string operation multiple times (RCX contains the count value) or until 
the specified zero flag (RFLAGS.ZF) condition occurs. 


The state of the direction flag (RFLAGS.DF) must be preserved across function 
boundaries. 


The test instruction is often used as an alternative to the cmp instruction, especially 
when testing a value to ascertain if it's less than, equal to, or greater than zero. 


The lea instruction can be used to simplify effective address calculations. 


CHAPTER 4 


Advanced Vector Extensions 


In the first three chapters of this book, you learned about the core x86-64 platform including its data types, 
general-purpose registers, and memory addressing modes. You also examined a cornucopia of sample code 
that illustrated the fundamentals of x86-64 assembly language programming, including basic operands, 
integer arithmetic, compare operations, conditional jumps, and manipulation of common data structures. 

This chapter introduces Advanced Vector Extensions (AVX). It begins with a brief overview of AVX 
technologies and SIMD (Single Instruction Multiple Data) processing concepts. This is followed by an 
examination of the AVX execution environment that covers register sets, data types, and instruction syntax. 
The chapter also includes discussions of AVX's scalar floating-point capabilities and its SIMD computational 
resources. The material presented in this chapter is relevant not only to AVX but also provides the necessary 
background information to understand AVX2 and AVX-512, which are explained in later chapters. 

In the discussions that follow in this and subsequent chapters, the term x86-AVX is used to describe 
general characteristics and computing resources of Advanced Vector Extensions. The acronyms AVX, AXV2, 
and AVX-512 are employed when examining attributes or instructions related to a specific x86 feature set 
enhancement. 


AVX Overview 


AMD and Intel first incorporated AVX into their CPUs starting in 2011. AVX extends the packed single- 
precision and double-precision floating-point capabilities of x86-SSE from 128 bits to 256 bits. Unlike 
general-purpose register instructions, AVX instructions use a three-operand syntax that employs non- 
destructive source operands, which simplifies assembly language programming considerably. Programmers 
can use this new instruction syntax with packed 128-bit integer, packed 128-bit floating-point, and packed 
256-bit floating-point operands. The three-operand instruction syntax can also be exploited to perform 
scalar single-precision and double-precision floating-point arithmetic. 

In 2013 Intel launched processors that included AVX2. This architectural enhancement extends the 
packed integer capabilities of AVX from 128 bits to 256 bits. AVX2 adds new data broadcast, blend, and 
permute instructions to the x86 platform. It also introduces a new vector-index addressing mode that 
facilitates memory loads (or gathers) of data elements from non-contiguous locations. The most recent 
x86-AVX extension is called AVX-512, which expands the SIMD capabilities of AVX and AVX2 from 256 bits 
to 512 bits. AVX-512 also adds eight new opmask registers named K0-K7 to the x86 platform. These registers 
facilitate conditional instruction execution and data merging operations using per-element granularity. 
Table 4-1 summarizes current x86-AVX technologies. In this table (and subsequent tables), the acronyms 
SPFP and DPFP are used to signify single-precision floating-point and double-precision floating-point, 
respectively. 
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Table 4-1. Summary of x86-AVX Technologies 


Feature AVX AVX2 AVX-512 
Three-operand syntax; non-destructive source operands Yes Yes Yes 
SIMD operations using 128-bit packed integers Yes Yes Yes 
SIMD operations using 256-bit packed integers No Yes Yes 
SIMD operations using 512-bit packed integers No No Yes 
SIMD operations using 128-bit packed SPFP, DPFP Yes Yes Yes 
SIMD operations using 256-bit packed SPFP, DPFP Yes Yes Yes 
SIMD operations using 512-bit packed SPFP, DPFP No No Yes 
Scalar SPFP, DPFP arithmetic Yes Yes Yes 
Enhanced SPFP, DPFP compare operations Yes Yes Yes 
Basic SPFP, DPFP broadcast and permute Yes Yes Yes 
Enhanced SPFP, DPFP broadcast and permute No Yes Yes 
Packed integer broadcast No Yes Yes 
Enhanced packed integer broadcast, compare, permute, conversions No No Yes 
Instruction-level broadcast and rounding control No No Yes 
Fused-multiply-add No Yes Yes 
Data gather No Yes Yes 
Data scatter No No Yes 
Conditional execution and data merging using opmask registers No No Yes 


It should be noted that fuse-multiply-add is a distinct x86 platform feature extension that was 
introduced in tandem with AVX2. A program must confirm the presence of this feature extension by testing 
the CPUID FMA feature flag before using any of the corresponding instructions. You'll learn how to do 
this in Chapter 16. The remainder of this chapter focuses primarily on AVX. Chapters 8 and 12 discuss the 
particulars of AVX2 and AVX-512 in greater detail. 


SIMD Programming Concepts 


As implied by the words of the acronym, a SIMD computing element executes the same operation on 
multiple data items simultaneously. Universal SIMD operations include basic arithmetic such as addition, 
subtraction, multiplication, and division. SIMD processing techniques can also be applied to a variety of 
other computational tasks including data compares, conversions, Boolean calculations, permutations, 
and bit shifts. Processors facilitate SIMD operations by reinterpreting the bits of an operand in a register 
or memory location. For example, a 128-bit wide operand can hold two independent 64-bit integer values. 
Itis also capable of accommodating four 32-bit integers, eight 16-bit integers, or sixteen 8-bit integers, as 
illustrated in Figure 4-1. 
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7 07 070707070707 OF OF OF 07 07 07. 07 0:7 0 


Figure 4-1. 128-bit wide operand using distinct integers 


Figure 4-2 exemplifies a few SIMD arithmetic operations greater detail. In this figure, integer addition 
is illustrated using two 64-bit integers, four 32-bit integers, or eight 16-bit integers. Faster algorithmic 
processing takes place when multiple data items are exercised, since the CPU can carry out the necessary 
operations in parallel. For example, when 16-bit integer operands are specified by an instruction, the CPU 


performs all eight 16-bit integer additions simultaneously. 
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Figure 4-2. SIMD integer addition 
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Wraparound vs. Saturated Arithmetic 


One extremely useful feature of x86-AVX technology is its support for saturated integer arithmetic. In 
saturated integer arithmetic, computational results are automatically clipped by the processor to prevent 
overflow and underflow conditions. This differs from normal wraparound integer arithmetic where an 
overflow or underflow result is retained (as you'll soon see). Saturated arithmetic is handy when working 
with pixel values since it automatically clips values and eliminates the need to explicitly check the result of 
each pixel calculation for an overflow or underflow condition. X86-AVX includes instructions that perform 
saturated arithmetic using 8-bit and 16-bit integers, both signed and unsigned. 

Let's take a closer look at some examples of both wraparound and saturated arithmetic. Figure 4-3 
shows an example of 16-bit signed integer addition using wraparound and saturated arithmetic. An overflow 
condition occurs if the two 16-bit signed integers are added using wraparound arithmetic. With saturated 
arithmetic, however, the result is clipped to the largest possible 16-bit signed integer value. Figure 4-4 
illustrates a similar example using 8-bit unsigned integers. Besides addition, x86-AVX also supports saturated 
integer subtraction, as shown in Figure 4-5. Table 4-2 summarizes the saturated arithmetic range limits for all 
supported integer sizes and sign types. 


16-bit Signed Integer Addition 


Wraparound Saturated 
20000 (0x4e20) 20000 (0x4e20) 
15000 (0x3a98) 15000 (0x3a98) 
-30536 (0x88b8) 32767 (0x7fff) 


Figure 4-3. 16-bit signed integer addition using wraparound and saturated arithmetic 


8-bit Unsigned Integer Addition 


Wraparound Saturated 
150 (0x96) 150 (0x96) 
+ 135 (0x87) 135 (0x87) 
29 (Oxid) 255 (Oxff) 


Figure 4-4. 8-bit unsigned integer addition using wraparound and saturated arithmetic 
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16-bit Signed Integer Subtraction 


Wraparound Saturated 


Figure 4-5. 16-bit signed integer subtraction using wraparound and saturated arithmetic 


Table 4-2. Range Limits for Saturated Arithmetic 


Integer Type Lower Limit Upper Limit 
8-bit signed -128 (0x80) +127 (0x7f) 
8-bit unsigned 0 +255 (Oxff) 
16-bit signed -32768 (0x8000)  +32767 (Ox7fff) 
16-bit unsigned 0 +65535 (Oxffff) 


AVX Execution Environment 


In this section you'll learn about the x86-AVX execution environment. Included are explanations of the 
AVX register set, its data types, and instruction syntax. As mentioned earlier, x86-AVX is an architectural 
enhancement that extends x86-SSE technology to support SIMD operations using either 256-bit or 128-bit 
wide operands. The material that’s presented in this section assumes no previous knowledge or experience 
with x86-SSE. 


Register Set 


X86-64 processors that support AVX incorporate 16 256-wide registers named YMMO0 - YMMI5. The low- 
order 128 bits of each YMM register are aliased to a corresponding XMM register, as illustrated in Figure 4-6. 
Most AVX instructions can use any of the XMM or YMM registers as SIMD operands. The XMM registers 

can also be employed to carry out scalar floating-point calculations using either single-precision or double- 
precision values similar to x86-SSE. Programmers with assembly language experience using x86-SSE need 
to be aware of some minor execution differences between this earlier instruction set extension and x86-AVX. 
These differences are explained later in this chapter. 
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Bit Position 


Figure 4-6. AVX register set 


The x86-AVX execution environment also includes a control-status register named MXCSR. This register 
contains status flags that facilitate the detection of error conditions caused by floating-point arithmetic 
operations. It also includes control bits that programs can use to enable or disable floating-point exceptions 
and specify rounding options. You'll learn more about MXCSR register later in this chapter. 


Data Types 


As previously mentioned, AVX supports SIMD operations using 256-bit and 128-bit wide packed single- 
precision or packed double-precision floating-point operands. A 256-bit wide YMM register or memory 
location can hold eight single-precision or four double-precision values, as shown in Figure 4-7. When used 
with a 128-bit wide XMM register or memory location, an AVX instruction can process four single-precision 
or two double-precision values. Like SSE and SSE2, AVX instructions use the low-order doubleword or 
quadword of an XMM register to carry out scalar single-precision or double-precision floating-point 
arithmetic. 
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YM Register or 256-Bit Memory——————————————— 
«——XMM Register or 128-Bit Memory—> 


Bit Position 
128: 96 


e 


Packed 
Bytes 


Packed 
Words 


Packed 
Doublewords 


DT 
D no a ] 
TRE LN NES NT REST NES ES INR 
Lo 1]. L1. T 
al 
[| 
Se ia em 
a es ee uu] 


Scalar 
SPFP 


Scalar 
DPFP 


Packed 
SPFP 


Packed 
DPFP 


i N28] N«241 N«201 N+16! N+12! N+8! N+4i Ni 


Byte Position 


Figure 4-7. AVX and AVX2 data types 


AVX also includes instructions that use the XMM registers to perform SIMD operations using a 
variety of packed integer operands including bytes, words, doublewords, and quadwords. AVX2 extends 
the packed integer processing capabilities of AVX to the YMM registers and 256-bit wide operands in 
memory. Figure 4-7 also shows these data types. 


Instruction Syntax 


Perhaps the most noteworthy programming facet of x86-AVX is its use of a contemporary assembly 
language instruction syntax. Most x86-AVX instructions use a three-operand format that consists of 
two source operands and one destination operand. The general syntax that's employed for x86-AVX 
instructions is InstrMnemonic DesOp,SrcOpi,SrcOp2. Here, InstrMnemonic signifies the instruction 
mnemonic, DesOp represents the destination operand, and SrcOp1 and Src0p2 denote the source 
operands. A small subset of x86-AVX instructions employ one or three source operands along with a 
destination operand. Nearly all x86-AVX instruction source operands are non-destructive. This means 
source operands are not modified during instruction execution, except in cases where the destination 
operand register is the same as one ofthe source operand registers. The use of non-destructive source 
operands often results in simpler and slightly faster code since the number of register-to-register data 
transfers that a function must perform is reduced. 
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X86-AVX's ability to support a three-operand instruction syntax is due to a new instruction-encoding 
prefix. The vector extension (VEX) prefix enables x86-AVX instructions to be encoded using a more efficient 
format than the prefixes used for x86-SSE instructions. The VEX prefix has also been used to add new 
general-purpose register instructions to the x86 platform. You'll learn about these instructions in Chapter 8. 


AVX Scalar Floating-Point 


This section examines the scalar floating-point capabilities of AVX. It begins with a short explanation of 
some important floating-point concepts including data types, bit encodings, and special values. Software 
developers who understand these concepts are often able to improve the performance of algorithms that 
make heavy use of floating-point arithmetic and minimize potential floating-point errors. The AVX scalar 
floating-point registers are also explained in this section and this includes descriptions the XMM registers 
and the MXCSR control-status register. The section concludes with an overview of the AVX scalar floating- 
point instruction set. 


Floating-Point Programming Concepts 


In mathematics a real-number system depicts an infinite continuum of all possible positive and negative 
numbers including integers, rational numbers, and irrational numbers. Given their finite resources, modern 
computing architectures typically employ a floating-point system to approximate a real-number system. 
Like many other computing platforms, the x86's floating-point system is based on the IEEE 754 standard 

for binary floating-point arithmetic. This standard includes specifications that define bit encodings, range 
limits, and precisions for scalar floating-point values. The IEEE 754 standard also specifies important details 
related to floating-point arithmetic operations, rounding rules, and numerical exceptions. 

The AVX instruction set supports common floating-point operations using single precision (32-bit) and 
double precision (64-bit) values. Many C++ compilers including Visual C++ use the x86's intrinsic single- 
precision and double-precision types to implement the C++ types float and double. Figure 4-8 illustrates 
the memory organization of both single-precision and double-precision floating-point values. This figure 
also includes common integer types for comparison purposes. 
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Figure 4-8. Memory organization of floating-point values 
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The binary encoding of a floating-point value requires three distinct fields: a significand, an exponent, 
and a sign bit. The significand field represents a number's significant digits (or fractional part). The exponent 
specifies the location of the binary "decimal" point in the significand, which determines the magnitude. The 
sign bit indicates whether the number is positive (s = 0) or negative (s = 1). Table 4-3 lists the various size 
parameters that are used to encode single-precision and double-precision floating-point values. 


Table 4-3. Floating-Point Size Parameters 


Parameter Single-Precision Double-Precision 
Total width 32 bits 64 bits 
Significand width 23 bits 52 bits 

Exponent width 8 bits 11 bits 

Sign width 1 bit 1 bit 

Exponent bias +127 +1023 


Figure 4-9 illustrates how to convert a decimal number into an x86 compatible floating-point encoded 
value. In this example, the number 237.8125 is transformed from a decimal number to its single-precision 
floating-point encoding. The process starts by converting the number from base 10 to base 2. Next, the 
base 2 value is transformed to a binary scientific value. The value to the right of the E, symbol is the binary 
exponent. A properly encoded floating-point value uses a biased exponent instead of the true exponent 
since this expedites floating-point compare operations. For a single-precision floating-point number, the 
bias value is +127. Adding the exponent bias value to the true exponent generates a binary scientific number 
with a biased exponent value. In the example that's shown in Figure 4-9, adding 111b (47) to 1111111b 
(4127) yields a binary scientific with a biased exponent value of 10000110b (+134). 


237.8125 Base 10 


1.11011011101 E; 111 Binary Scientific 
1.11011011101 E; 10000110 binary Scientic wien 
Biased Exponent 
0 10000110 11011011101000000000000 sg) BE Exponent, 
Normalized Significand 
0x436DD000 Final Single-Precision 
Encoding 


Figure 4-9. Single-precision floating-point encoding process 
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When encoding a single-precision or double-precision floating-point value, the leading 1 digit of the 
significand is implied and not included in the final binary representation. Dropping the leading 1 digit forms 
a normalized significand. The three fields required for an IEEE 754 complaint encoding are now available, 
as shown in Table 4-4. A reading of the bit fields in this table from left to right yields the 32-bit value 
0x436DD000, which is the final single-precision floating-point encoding of 237.8125. 


Table 4-4. Bit Fields for IEEE 754 Compliant Encoding of 237.8125 


Sign Biased Exponent Normalized Significand 


1 10000110 11011011101000000000000 


The IEEE 754 floating-point encoding scheme reserves a small set of bit patterns for special values that 
are used to handle certain processing conditions. The first group of special values includes denormalized 
numbers (or denormal). As shown in the earlier encoding example, the standard encoding of a floating- 
point number assumes that the leading digit of the significand is always a 1. One limitation of IEEE 754 
floating-point encoding scheme is its inability to accurately represent numbers very close to zero. In these 
cases, values get encoded using a non-normalized format, which enables tiny numbers close to zero (both 
positive and negative) to be encoded using less precision. Denormals rarely occur but when they do, the 
CPU can still process them. In algorithms where the use of a denormal is problematic, a function can test a 
floating-point value in order to ascertain its denormal state or the processor can be configured to generate 
an underflow or denormal exception. 

Another application of special values involves the encodings that are used for floating-point zero. The 
IEEE 754 standard supports two different representations of floating-point zero: positive zero (+0.0) and 
negative zero (-0.0). A negative zero can be generated either algorithmically or as a side effect of the floating- 
point rounding mode. Computationally, the processor treats positive and negative zero the same and the 
programmer typically does not need to be concerned. 

The IEEE 754 encoding scheme also supports positive and negative representations of infinity. Infinities 
are produced by certain numerical algorithms, overflow conditions, or division by zero. As discussed later in 
this chapter, the processor can be configured to generate an exception whenever a floating-point overflow 
occurs or if a program attempts to divide a number by zero. 

The final special value type is called Not a Number (NaN). NaNs are floating-point encodings that 
represent invalid numbers. The IEEE 754 standard defines two types of NaNs: signaling NaN (SNaN) and 
quiet NaN (QNaN). SNaNs are created by software; an x86-64 CPU will not create a SNaN during any 
arithmetic operation. Any attempt by an instruction to use a SNaN will cause an invalid operation exception, 
unless the exception is masked. SNaNs are useful for testing exception handlers. They can also be exploited 
by an application program for proprietary numerical-processing purposes. An x86 CPU uses QNaNs as a 
default response to certain invalid arithmetic operations whose exceptions are masked. For example, one 
unique encoding of a QNaN, called an indefinite, is substituted for a result whenever a function uses one 
of the scalar square root instructions with a negative value. QNaNs also can be used by programs to signify 
algorithm-specific errors or other unusual numerical conditions. When QNaNs are used as operands, they 
enable continued processing without generating an exception. 

When developing software that performs floating-point calculations, it is important to keep in mind that 
the employed encoding scheme is simply an approximation of a real-number system. It is impossible for any 
floating-point encoding system to represent an infinite number of values using a finite number of bits. This 
leads to floating-point rounding errors that can affect the accuracy of a calculation. Also, some mathematical 
properties that hold true for integers and real numbers are not necessarily true for floating-point numbers. 
For example, floating-point multiplication is not necessarily associative; (a * b) * cmaynotequala * 

(b * c) depending on the values of a, b, and c. Developers of algorithms that require high levels of floating- 
point accuracy must be aware of these issues. Appendix A contains a list of references that explain this and 
other potential pitfalls of floating-point arithmetic in greater detail. Chapter 9 also includes a source code 
example that exemplifies floating-point non-associativity. 
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Scalar Floating-Point Register Set 


As previously shown in Figure 4-6, all x86-64 compatible processors include 16 128-bit registers named 
XMMO - XMMI5. A program can use any of the XMM registers to perform scalar floating-point operations 
including common arithmetic calculations, data transfers, comparisons, and type conversions. The CPU 
uses the low-order 32 bits of an XMM register to carry out single-precision floating-point calculations. 
Double-precision floating-point operations employ the low-order 64 bits. Figure 4-10 illustrates these 
register locations in greater detail. Programs cannot use the high-order bits of an XMM register to perform 
scalar floating-point calculations. However, when used as a destination operand, the values of these bits 
might be modified during the execution of an AVX scalar floating-point instruction as explained later in 
this section. 


Bit Position 
127 64:63 :31 0: 


4 Single-Precision ——p 


F3 = No calculation 


Figure 4-10. Scalar floating-point values when loaded in an XMM register 


Control-Status Register 


In addition to the XMM registers, x86-64 processors include a 32-bit control-status register named MXCSR. 
This register contains a series of control flags that enable a program to specify options for floating-point 
calculations and exceptions. It also includes a set of status flags that can be tested to detect floating-point 
error conditions. Figure 4-11 shows the organization of the bits in MXCSR; Table 4-5 describes the purpose 
of each bit field. 


Figure 4-11. MXCSR control and status register 
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Table 4-5. Description of MXCSR Register Bit Fields 


Bit Field Name Description 

IE Invalid operation flag Floating-point invalid operation error flag. 

DE Denormal flag Floating-point denormal error flag. 

ZE Divide-by-zero flag Floating-point division-by-zero error flag. 

OE Overflow flag Floating-point overflow error flag. 

UE Underflow flag Floating-point underflow error flag. 

PE Precision flag Floating-point precision error flag. 

DAZ Denormals are zero When set to 1, forcibly converts a denormal source operand to zero 
prior to its use in a calculation. 

IM Invalid operation mask Floating-point invalid operation error exception mask. 

DM Denormal mask Floating-point denormal error exception mask. 

ZM Divide-by-zero mask Floating-point divide-by-zero error exception mask. 

OM Overflow mask Floating-point overflow error exception mask. 

UM Underflow mask Floating-point underflow error exception mask. 

PM Precision mask Floating-point precision error exception mask. 

RC Rounding control Specifies the method for rounding floating-point results. Valid options 
include round to nearest (00b), round down toward +c (01b), round 
up toward +o (10b), and round toward zero or truncate (11b). 

FTZ Flush to zero When set to 1, forces a zero result ifthe underflow exception is masked 


and a floating-point underflow error occurs. 


An application program can modify any of the MXCSR's control flags or status bits to accommodate its 


specific SIMD floating-point processing requirements. Any attempt to write a non-zero value to a reserved 
bit position will cause the processor to generate an exception. The processor sets an MXCSR error flag to 
1following the occurrence of an error condition. MXCSR error flags are not automatically cleared by the 
processor after an error is detected; they must be manually reset. The control flags and status bits of the 


MXCSR register can be modified using the v1dmxcsr (Load MXCSR Register) instruction. Setting a mask bit 
to 1 disables the corresponding exception. The vstmxcsr (Store MXCSR Register) instruction can be used to 
save the current MXCSR state. An application program cannot directly access the internal processor tables 
that specify floating-point exception handlers. However, most C++ compilers provide a library function that 
allows an application program to designate a callback function that gets invoked whenever a floating-point 
exception occurs. 

The MXCSR includes two control flags that can be used to speed up certain floating-point calculations. 
Setting the MXCSR.DAZ control flag to 1 can improve the performance of algorithms where the rounding 
ofa denormal value to zero is acceptable. Similarly, the MXCSR.FTZ control flag can be used to accelerate 
computations where floating-point underflows are common. The downside of enabling either of these 
options is non-compliance with the IEEE 754 floating-point standard. 


Instruction Set Overview 


Table 4-6 lists in alphabetical order commonly used AVX scalar floating-point instructions. In this table, 
mnemonic text [d|s] signifies that an instruction can be used with either double-precision floating-point or 
single-precision floating-point operands. You'll learn how to use many of these instructions in Chapter 5. 
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Table 4-6. Overview of Commonly-Used AVX Scalar Floating-Point Instructions 


Mnemonic 


Description 


vadds[d|s] 
vbroadcasts [d |s] 
vcmps [d|s] 

vcomis [d|s] 
vcvts[d|s]2si 
vcvtsd2ss 
vcvtsi2s[d|s] 
vcvtss2sd 
vcvtts[d|s]2si 
vdivs[d|s] 
vmaxs[d|s] 
vmins[d|s] 
vmovs[d|s] 
vmuls[d|s] 
vrounds [d|s] 
vsqrts[d|s] 
vsubs[d|s] 


vucomis [d|s] 


Scalar floating-point addition 

Broadcast scalar floating-point value 

Scalar floating-point compare 

Ordered scalar floating-point compare and set RFLAGS 
Convert scalar floating-point to doubleword signed integer 
Convert scalar DPFP to scalar SPFP 

Convert signed doubleword integer to scalar floating-point 
Convert scalar SPFP to DPFP 

Convert with truncation scalar floating-point to signed integer 
Scalar floating-point division 

Scalar floating-point maximum 

Scalar floating-point minimum 

Move scalar floating-point value 

Scalar floating-point multiplication 

Round scalar floating-point value 

Scalar floating-point square root 

Scalar floating-point subtraction 


Unordered scalar floating-point compare and set RFLAGS 


Table 4-7 illustrates operation of the AVX scalar floating-point instructions vadds[d|s] and 
vsqrts[d|s]. In these examples, the colon notation designates bit position ranges within a register (e.g., 
31:0 designates bits positions 31 through 0 inclusive). Note that execution of an AVX scalar floating-point 
instruction also copies the unused bits of the first source operand to the destination operand. Also note that 
the upper 128 bits of the corresponding YMM register are set to zero. 


Table 4-7. AVX Scalar Floating-Point Instruction Examples 


Instruction Operation 

vaddss xmmo, xmm1, xmm2 xmmO[31:0] = xmm1[31:0] + xmm2[31:0] 
xmmO[127:32] = po 32] 
ymmo[255:128] = 

vaddsd xmmo, xmm1, xmm2 xmmo[63:0] - m 0] + xmm2[63:0] 
xmmO[127:64] = E 64] 
ymmo[255:128] = 

vsqrtss xmmo,xmm1,xmm2 xmmo[31:0] = DER 0]) 
xmmO[127:32] = Dm 32] 
ymmo[255:128] = 

vsqrtsd xmmO, xmm1, xmm2 xmmo[63:0] = sqrt(xmm2[63:0]) 
xmmO[127:64] = xmm1[127:64] 
ymmO[255:128] = 
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AVX Packed Floating-Point 


AVX supports packed floating-point operations using either 128-bit wide or 256-bit wide operands. 
Figures 4-12 and 4-13 illustrate common packed floating-point arithmetic operations using 256-bit wide 
operands with single-precision and double-precision elements. Similar to AVX scalar floating-point, 
rounding for AVX packed floating-point arithmetic operations is specified by the MXCSR’s rounding 
control field, as defined in Table 4-5. The processor also uses the MXCSR’s status flags to signal the 
occurrence of a packed floating-point error condition. 


vaddps ymm2,ymmO,ymm1 ;packed single-precision addition 


| 20 | as | azas | sno | m2 es os | as | ymmo 


vmulps ymm2,ymm0,ymm1 ;packed single-precision multiplication 


ymmo 


idi EE iii | mes | seas | coms | ars | yama 


Figure 4-12. AVX packed single-precision floating-point addition 


vsubpd ymm2,ymm0,ymm1 ;packed double-precision subtraction 


ymm2 


vdivpd ymm2,ymm0,ymm1 ;packed double-precision division 


om [| = | = IL 


Figure 4-13. AVX packed double-precision floating-point multiplication 


ymmo 
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Most AVX arithmetic instructions perform their operations using the corresponding element positions 
ofthe two source operands. AVX also supports horizontal arithmetic operations using either packed 
floating-point or packed integer operands. A horizontal arithmetic operation carries out its computations 
using the adjacent elements of a packed data type. Figure 4-14 illustrates horizontal addition using single- 
precision floating-point and horizontal subtraction using double-precision floating-point operands. The 
AVX instruction set also supports integer horizontal addition and subtraction using packed words and 
doublewords. Horizontal operations are typically used to reduce a packed data operand that contains 
multiple intermediate values to a single final result. 


vhaddps xmm3,xmmi1,xmm?2 _ ;horizontal packed single-precision addition 


xmm2 


xmmi 


xmm3 


vhsubpd xmm3,xmmi1,xmm2  ;horizontal packed double-precision subtraction 


xmm2 


xmm1 


xmm3 


Figure 4-14. AVX horizontal addition and subtraction using single-precision and double-precision elements 


Instruction Set Overview 


Table 4-8 lists in alphabetical order commonly used AVX packed floating-point instructions. Similar to 
the scalar floating-point table that you saw in the previous section, the mnemonic text [d|s] signifies that 
an instruction can be used with either packed double-precision floating-point or packed single-precision 
floating-point operands. You'll learn how to use many of these instructions in Chapter 6. 
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Table 4-8. Overview of Commonly-Used AVX Packed Floating-Point Instructions 


Instruction Description 

vaddp[d|s] Packed floating-point addition 
vaddsubp[d|s] Packed floating-point add-subtract 
vandp[d|s] Packed floating-point bitwise AND 


vandnp[d|s] 
vblendp[d|s] 
vblendvp[d|s] 
vcmpp[d|s] 
vcvtdq2p[d|s] 
vcvtp[d|s]2dq 
vcvtpd2ps 
vcvtps2pd 
vdivp[d|s] 
vdpp[d|s] 
vhaddp[d|s] 
vhsubp[d|s] 
vmaskmovp[d| s] 
vmaxp[d|s] 
vminp[d|s] 
vmovap[d|s] 
vmovmskp[d|s] 
vmovup[d|s] 
vmulp[d|s] 
vorp[d|s] 
vpermilp[d|s] 
vroundp[d|s] 
vshufp[d|s] 
vsqrtp[d|s] 
vsubp[d|s] 
vunpckhp[d|s] 
vunpcklp[d|s] 
vxorp[d|s] 


Packed floating-point bitwise AND NOT 
Packed floating-point blend 
Variable packed floating-point blend 


Packed floating-point compare 


Convert packed signed doubleword integers to floating-point 


Convert packed floating-point to signed doublewords 


Convert packed DPFP to packed SPFP 

Convert packed SPFP to packed DPFP 

Packed floating-point division 

Packed dot product 

Horizontal packed floating-point addition 
Horizontal packed floating-point subtraction 
Packed floating-point conditional load and store 
Packed floating-point maximum 

Packed floating-point minimum 

Move aligned packed floating-point values 
Extract packed floating-point sign bitmask 
Move unaligned packed floating-point values 
Packed floating-point multiplication 

Packed floating-point bitwise inclusive OR 
Permute in-lane packed floating-point elements 
Round packed floating-point values 

Shuffle packed floating-point values 

Packed floating-point square root 


Packed floating-point subtraction 


Unpack and interleave high packed floating-point values 


Unpack and interleave low packed floating-point values 


Packed floating-point bitwise exclusive OR 
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AVX Packed Integer 


AVX supports packed integer operations using 128-bit wide operands. A 128-bit wide operand facilitates 
packed integer operations using two quadwords, four doublewords, eight words, or sixteen bytes, as Shown 
in Figure 4-15. In this figure, the vpaddb (Add Packed Integers) instruction illustrates packed 8-bit integer 
addition. The vpmaxsw (Packed Signed Integer Maximums) saves the maximum signed word value of each 
element pair to the specified destination operand. The vpmulld (Multiply Packed Integers and Store Low 
Result) carries out packed signed doubleword multiplication and saves the low-order 32 bits of each result. 
Finally, the vps11q (Shift Packed Data Left Logical) performs a logical left shift of each quadword element 
using the bit count that's specified by the immediate operand. Note that this instruction supports the use of 
an immediate operand to specify the bit count. 


vpaddb xmm2,xmm0,xmm1 ;packed unsigned byte addition 


vpmaxsw xmm2,xmm0,xmm1 ;packed signed word maximums 


vpmulld xmm2,xmmO0,xmm1 ;packed signed doubleword multiplication 


vpsllq xmm2,xmm1,2 ;packed quadword logical left shift 


Figure 4-15. Example AVX packed integer operations 
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Most AVX packed integer instructions do not update the status flags in the RFLAGS register. This means 
that error conditions such as arithmetic overflow and underflow are not reported. It also means that the 
results of a packed integer operation do not directly affect execution of the conditional instructions cmovcc, 
jcc, and setb. However, programs can employ SIMD-specific techniques to make logical decisions based on 
the outcome of a packed integer operation. You'll see examples of these techniques in Chapter 7. 


Instruction Set Overview 


Table 4-9 lists in alphabetical order commonly-used AVX packed integer instructions. In this table, the 
mnemonic text [b|w|d|q] signifies the size (byte, word, doubleword, or quadword) of the elements that are 


processed. You'll learn how to use many of these instructions in Chapter 7. 


Table 4-9. Overview of Commonly-Used AVX Packed Integer Instructions 


Instruction Description 

vmov [d|q] Move to/from XMM register 

vmovdqa Move aligned packed integer values 
vmovdqu Move unaligned packed integer values 


vpabs[b|w|d] 
vpackss [dw |wb] 
vpackus [dw |wb] 
vpadd[b|w|d|q] 
vpadds[b |w] 
vpaddus [b |w] 
vpand 

vpandn 
vpcmpeq[b|w|d|q] 
vpcmpgt[b|w|d|q] 
vpextr[b|w|d|q] 
vphadd[w|d] 
vphsub[w|d] 
vpinsr[b|w|d|q] 
vpmaxs[b |w|d] 
vpmaxu[b |w|d] 
vpmins[b|w|d] 
vpminu[b|w|d] 
vpmovsx 
vpmovzx 
vpmuldq 


vpmulhuw 


Packed integer absolute value 

Pack with signed saturation 

Pack with unsigned saturation 

Packed integer addition 

Packed integer addition with signed saturation 
Packed integer addition with unsigned saturation 
Packed bitwise AND 

Packed bitwise AND NOT 

Pack integer compare for equality 

Packed signed integer compare for greater than 
Extract integer from XMM register 

Horizontal packed addition 

Horizontal packed subtraction 

Insert integer into XMM register 

Packed signed integer maximum 

Packed unsigned integer maximum 

Packed signed integer minimum 

Packed unsigned integer minimum 

Packed integer move with sign extend 

Packed integer move with zero extend 

Packed signed doubleword multiplication 


Packed unsigned word multiplication, save high result 
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Table 4-9. (continued) 


Instruction Description 

vpmul[h|1]w Packed signed word multiplication, save [high | low] result 
vpmull[d|w] Packed signed multiplication (save low result) 
vpmuludq Packed unsigned doubleword multiplication 
vpshuf[b|d] Shuffle packed integers 

vpshuf[h|1]w Shuffle [high | low] packed words 

vpslldq Shift logical left double quadword 

vpsll[w|d|q] Packed logical shift left 

vpsra[w|d] Packed arithmetic shift right 

vpsrldq Shift logical right double quadword 

vpsrl[w|d|q] Packed logical shift right 

vpsub[b|w|d|q] Packed integer subtraction 

vpsubs[b|w] Packed integer subtraction with signed saturation 
vpsubus[b|w] Packed integer subtraction with unsigned saturation 


vpunpckh[bw|wd|dq] ^ Unpack high data 
vpunpckl[bw|wd|dq] ^ Unpack low data 


Differences Between x86-AVX and x86-SSE 


If you have any previous experience with x86-SSE assembly language programming, you have undoubtedly 
noticed that a high degree of symmetry exists between this execution environment and x86-AVX. Most x86- 
SSE instructions have an x86-AVX equivalent that can use either 256-bit or 128-bit wide operands. There 
are, however, a few important differences between the x86-SSE and x86-AVX execution environments. The 
remainder of this section explains these differences. Even if you don’t have any previous experience with 
x86-SSE, I still recommend reading this section since it elucidates important details that you need to be 
aware of when writing code that uses the x86-AVX instruction set. 

Within an x86-64 processor that supports x86-AVX, each 256-bit YMM register is partitioned into an 
upper and lower 128-bit lane. Many x86-AVX instructions carry out their operations using same-lane source 
and destination operand elements. This independent lane execution tends to be inconspicuous when using 
x86-AVX instructions that perform arithmetic calculations. However, when using instructions that re-order 
the data elements of a packed quantity, the effect of separate execution lanes is more evident. For example, 
the vshufps (Packed Interleave Shuffle of Single-Precision Values) instruction rearranges the elements of 
its source operands according to a control mask that’s specified as an immediate operand. The vpunpcklwd 
(Unpack Low Data) instruction interleaves the low-order elements in its two source operands. Figure 4-16 
illustrates the in-lane effect of these instructions in greater detail. Note that the floating-point shuffle and 
unpack operations are carried out independently in both the upper (bits 255:128) and lower (bits 127:0) 
double quadwords. You'll learn more about the vshufps and vpunpcklud instructions in Chapters 6 and 7. 
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vshufps ymmO,ymm1,ymm2,01110010b  ;shuffle single-precision elements 
3 2 1 0 - 3 2 1 0 
[ ao | so | 20 | 10 | 4 | 3» | 2 | 1 | mm 
| sm | mo | coo | se | s | m | e | s jm 
Ce[ep[wp[ejerpererls]e 
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vpunpcklwd ymm0,ymm1,ymm2 ;unpack low words to doublewords 


L 

LU 
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Figure 4-16. Examples of x86-AVX instruction execution using independent lanes 


The aliasing of the XMM and YMM register sets introduces a few programming issues that software 
developers need to keep in mind. The first issue relates to the processor's handling of a YMM register's 
high-order 128 bits when the corresponding XMM register is used as a destination operand. When executing 
on a processor that supports x86-AVX technology, an x86-SSE instruction that uses an XMM register as a 
destination operand will never modify the upper 128 bits of the corresponding YMM register. However, the 
equivalent x86-AVX instruction will zero the upper 128 bits of the respective YMM register. Consider, for 
example, the following instances of the (v)cvtps2pd (Convert Packed Single-Precision to Packed Double- 
Precision) instruction: 


cvtps2pd xmmo,xmm1i 
vcvtps2pd xmmo,xmm1i 
vcvtps2pd ymmo, ymm1i 


The x86-SSE cvtps2pd instruction converts the two packed single-precision floating-point values in the 
low-order quadword of XMMI to double-precision floating-point and saves the result in register XMMO. 
This instruction does not modify the high-order 128 bits of register YMMO. The first vcvtps2pd instruction 
performs the same packed single-precision to packed double-precision conversion operation; it also zeros 
the high-order 128 bits of YMMO. The second vcvtps2pd instruction converts the four packed single- 
precision floating-point values in the low-order 128 bits of YMMI to packed double-precision floating-point 
values and saves the result to YMMO. 

X86-AVX relaxes the alignment requirements of x86-SSE for packed operands in memory. Except for 
instructions that explicitly specify an aligned operand (e.g., vmovaps, vmovdqa, etc.), proper alignment of a 
128-bit or 256-bit wide operand in memory is not mandatory. However, 128-bit and 256-bit wide operands 
should always be properly aligned whenever possible in order to prevent processing delays that can occur 
when the processor accesses unaligned operands in memory. 
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The last issue that programmers need to be aware of involves the intermixing of x86-AVX and x86-SSE 
code. Programs are allowed to intermix x86-AVX and x86-SSE instructions, but any intermixing should be 
kept to a minimum in order avoid internal processor state transition penalties that can affect performance. 
These penalties can occur if the processor is required to preserve the upper 128 bits of each YMM register 
during a transition from executing x86-AVX to executing x86-SSE instructions. State transition penalties 
can be completely avoided by using the vzeroupper (Zero Upper Bits of YMM Registers) instruction, which 
zeroes the upper 128 bits of all YMM registers. This instruction should be used prior to any transition from 
256-bit x86-AVX code (i.e., any x86-AVX code that uses a YMM register) to x86-SSE code. 

One common use of the vzeroupper instruction is by a public function that uses 256-bit x86-AVX 
instructions. These types of functions should include a vzeroupper instruction prior to the execution of 
any ret instruction since this prevents processor state transition penalties from occurring in any high-level 
language code that uses x86-SSE instructions. The vzeroupper instruction should also be employed before 
calling any library functions that might contain x86-SSE code. Later in this book, you'll see several source 
code examples that demonstrate proper use of the vzeroupper instruction. Functions can also use the 
vzeroall (Zero All YMM Registers) instruction instead of vzeroupper to avoid potential x86-AVX/x86-SSE 
state transition penalties. 


Summary 


Here are the key learning points for Chapter 4: 


e AVX technology is an x86 platform architectural enhancement that facilitates SIMD 
operations using 128-bit and 256-bit wide packed floating-point operands, both 
single-precision and double-precision. 


e AVX also supports SIMD operations using 128-bit wide packed integer and scalar 
floating-point operands. AVX2 extends the AVX instruction set to support SIMD 
operations using 256-bit wide packed integer operands. 


e AVX adds 16 YMM (256-bit) and XMM (128-bit) registers to the x86-64 platform. 
Each XMM register is aliased with the low-order 128 bits of its corresponding YMM 
register. 


e Most AVX instructions use a three-operand syntax that includes two non-destructive 
source operands. 


e AVX floating-point operations conform to the IEEE 754 standard for floating-point 
arithmetic. 


e Programs can use the control and status flags in the MXCSR register to enable 
floating-point exceptions, detect floating-point error conditions, and configure 
floating-point rounding. 


e Except for instructions that explicitly specify aligned operands, 128-bit and 256-bit 
wide operands in memory need not be properly aligned. However, SIMD operands 
in memory should always be properly aligned whenever possible to avoid delays that 
can occur when the processor accesses an unaligned operand in memory. 


e Avzeroupper or Vzeroall instruction should be used in any function that uses a 
YMM register as an operand in order to avoid potential x86-AVX to x86-SSE state 
transition performance penalties. 
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AVX Programming - Scalar 
Floating-Point 


In the previous chapter, you learned about the architecture and computing capabilities of AVX. In this 
chapter, you'll learn how to use the AVX instruction set to perform scalar floating-point calculations. The 
first section includes a couple of sample programs that illustrate basic scalar floating-point arithmetic 
including addition, subtraction, multiplication, and division. The next section contains code that explains 
use ofthe scalar floating-point compare and conversion instructions. This is followed by two examples that 
demonstrate scalar floating-point operations using arrays and matrices. The final section of this chapter 
formally describes the Visual C++ calling convention. 

All ofthe sample code in this chapter requires a processor and operating system that support AVX. You 
can use one ofthe freely-available tools listed in Appendix A to determine whether or not your computer 
fulfills this requirement. In Chapter 16, you learn how to programmatically detect the presence of AVX and 
other x86 processor feature extensions. 


Note Developing software that employs floating-point arithmetic always entails a few caveats. The purpose 
of the sample code presented in this and subsequent chapters is to illustrate the use of various x86 floating- 
point instructions. The sample code does not address important floating-point concerns such as rounding 
errors, numerical stability, or ill-conditioned functions. Software developers must always be cognizant of these 
issues during the design and implementation of any algorithm that employs floating-point arithmetic. If you're 
interested in learning more about the potential pitfalls of floating-point arithmetic, you should consult the 
references listed in Appendix A. 


Scalar Floating-Point Arithmetic 


The scalar floating-point capabilities of AVX provides programmers with a modern alternative to the 
floating-point resources of SSE2 and the legacy x87 floating-point unit. The ability to exploit addressable 
registers means that performing elementary scalar floating-point operations such as addition, subtraction, 
multiplication, and division is similar to performing integer arithmetic using the general-purpose registers. 
In this section you learn how to code functions that perform basic floating-point arithmetic using the AVX 
instruction set. The source code examples demonstrate how to perform fundamental operations using both 
single-precision and double-precision values. You also learn about floating-point argument passing, return 
values, and MASM directives. 
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Single-Precision Floating-Point 


Listing 5-1 (example Ch05_01) shows the C++ and assembly language source code for a simple program that 
performs Fahrenheit to Celsius temperature conversions using single-precision floating-point arithmetic. 
The C++ code begins with a declaration for the assembly language function ConvertFtoC . Note that this 
function requires one argument of type float and returns a value of type float. A similar declaration is 
also used for the assembly language function ConvertCtoF .The remaining C++ code exercises the two 
temperature conversion functions using several test values and displays the results. 


Listing 5-1. Example Ch05 01 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 


using namespace std; 


extern "C" float ConvertFtoC (float deg f); 
extern "C" float ConvertCtoF (float deg c); 


int main() 


{ 


const int w = 10; 

float deg fvals[] = (-459.67f, -40.0f, 0.0f, 32.0f, 72.0f, 98.6f, 212.0f}; 
size t nf - sizeof(deg fvals) / sizeof(float); 

cout «« setprecision(6); 


cout << "\n-------- ConvertFtoC Results -------- \n"; 


for (size t i = 0; i < nf; i++) 


{ 
float deg c = ConvertFtoC (deg fvals[i]); 
cout << " it"«ci««" '5 
cout << "f: " << setw(w) << deg fvals[i] <<" "; 
cout << "c: " << setw(w) << deg c << '\n'; 
) 
cout << "\n-------- ConvertCtoF Results -------- \n"; 


float deg cvals[] = (-273.15f, -40.0f, -17.777778f, 0.0f, 25.0f, 37.0f, 100.0f}; 
size t nc - sizeof(deg cvals) / sizeof(float); 


for (size t i = 0; i < nc; i++) 


float deg f - ConvertCtoF (deg cvals[i]); 
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cout << " i: "<< i <<" '5 
cout << "c: " << setw(w) << deg cvals[i] «« " "5 
cout << "f: " << setw(w) << deg f << '\n'; 
} 
return 0; 

) 

j mm 

; Cho5_01.asm 

H mm 
.const 

r4 ScaleFtoC real4 0.55555556 35/9 

r4 ScaleCtoF  real4 1.8 39/5 

r4 32p0 real4 32.0 

; extern "C" float ConvertFtoC (float deg f) 

; 

; Returns: xmmo[31:0] = temperature in Celsius. 
.code 

ConvertFtoC proc 
vmovss xmm1,[r4 32p0] ;xmmi = 32 
vsubss xmm2,xmmo, xmm1 ;xmm2-f-232 
vmovss xmm1,[r4 ScaleFtoC] ;xmmi-25/9 
vmulss xmmo,xmm2,xmm1 ;xmmo- (f - 32) * 5 / 9 
ret 

ConvertFtoC endp 

; extern "C" float CtoF (float deg c) 

; 

; Returns: xmm0[31:0] = temperature in Fahrenheit. 

ConvertCtoF proc 
vmulss xmmo,xmmo, [r4 ScaleCtoF] ;xmmO = c * 9/5 
vaddss xmmO,xmmO, [r4 32p0] ;xmmO = € * 9 / 5 + 32 
ret 

ConvertCtoF endp 
end 


The assembly language code starts with a .const section that defines the constants needed to convert 
a temperature value from Fahrenheit to Celsius and vice versa. The text real4 is a MASM directive that 
allocates storage space for a single-precision floating-point value. Following the .const section is the code 
for function ConvertFtoC . The first instruction of this function, vmovss xmm1,[r4 32p0], loads the single- 
precision floating-point value 32.0 from memory into register XMM1 (or more precisely into XMM1[31:0]). 
A memory operand is used here since, unlike the general-purpose registers, floating-point values cannot be 
used as immediate operands. 
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Per the Visual C++ calling convention, the first four floating-point argument values are passed to 
a function using registers XMMO0, XMM1, XMM2, and XMM3. This means that upon entry to function 
ConvertFtoC , register XMMO contains the argument value deg f. Following execution of the vmovss 
instruction, the vsubss xmm2,xmmO, xmm1 instruction calculates deg f - 32.0 and saves the result to XMM2. 
Execution of the vsubss instruction does not modify the contents of the source operands XMMO0 and 
XMMI. This instruction also copies bits XMMO0[127:32] to XMM2[127:32]. The ensuing vmovss xmm1,[r4 
ScaleFtoC] loads the constant value 0.55555556 (or 5 / 9) into register XMMI. This is followed by a vmulss 
xmmo , xmm2 , xmm1 instruction that computes (deg f - 32.0) * 0.55555556 and saves the multiplicative 
result (i.e., the converted temperature in Celsius) in XMMO. The Visual C++ calling convention designates 
register XMMO for floating-point return values. Since the return value is already in XMMO, no additional 
vmovss instructions are necessary. 

The assembly language function ConvertCtoF follows next. The code for this function differs slightly 
from ConvertFtoC in that the floating-point arithmetic instructions use memory operands to reference 
the required conversion constants. At entry to ConvertCtoF , register XMMO contains argument value 
deg c. The instruction vmulss xmmO,xmmO, [r4_ScaleCtoF] calculates deg c * 1.8. This is followed by 
an vaddss xmmo,xmmo, [r4 32p0] instruction that calculates deg c * 1.8 + 32.0. At this point it would 
be scientifically remiss for me not to mention that neither ConvertFtoC norConvertCtoF perform any 
validity checks for argument values that are physically impossible (e.g., -1000 degrees Fahrenheit). Such 
checks require floating-point compare instructions and you'll learn how to use these instructions later in this 
chapter. Here are the results for source code example Ch05 01. 


-------- ConvertFtoC Results -------- 


i: 0. f: -459.67 c: -273.15 
1: i f: -40 c: -40 
i: 2 f: O c: -17.7778 
1:3. f: 32 c: 0 
1 À d 72. C: 22.2222 
i:5 f: 98.6 c: 37 
i: 6 f: 212 €: 100 


-------- ConvertCtoF Results -------- 


i: 0 €: -273.15 f: -459.67 
1: À ci -40 f: -40 
i: 2 c: -47.7778 f: 0 
1:3 €: o. f: 32 
ii À. ci 25 f: 71 
i:5 c: 37 f: 98.6 
i: 6 c: 100 f: 212 


Double-Precision Floating-Point 


The source code examples presented in this section illustrate simple floating-point arithmetic using double- 
precision values. Listing 5-2 shows the source code for example Ch05 02. In this example, the assembly 
language function CalcSphereAreaVolume calculates the surface area and volume of a sphere using the 
supplied radius value. 
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Listing 5-2. Example Ch05 02 


#include "stdafx.h" 
#include <iostream> 
#include «iomanip» 

using namespace std; 


extern "C" void CalcSphereAreaVolume (double r, double* sa, double* vol); 


int tmain(int argc, TCHAR* argv[]) 


( 
double r[] = { 0.0, 1.0, 2.0, 3.0, 5.0, 10.0, 20.0, 32.0 }; 
size t num r - sizeof(r) / sizeof(double); 
cout «« setprecision(8); 
cout «« "An--------- Results for CalcSphereAreaVol ----------- \n"; 
for (size t i = 0; i < num r; i++) 
{ 
double sa = -1, vol = -1; 
CalcSphereAreaVolume (r[i], &sa, &vol); 
cout << "i: " << i <<" '5 
cout << "r: " << setw(6) << rfi] «< " "; 
cout << "sa: " << setw(11) << sa << " '5 
cout << "vol: " << setw(11) << vol << '\n'; 
} 
return 0; 
) 
F mm 
3 Cho5_02.asm 


.const 
r8 PI  real8 3.14159265358979323846 
r8 4p0 real8 4.0 
r8 3p0 real8 3.0 
; extern "C" void CalcSphereAreaVolume (double r, double* sa, double* vol); 


.code 
CalcSphereAreaVolume proc 


; Calculate surface area = 4 * PI * r* r 
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vmulsd xmmi,xmmoO, xmmO ;xmm1-2r*r 
vmulsd xmm2,xmm1,[r8 PI] ;xmm2 - r * r * PI 
vmulsd xmm3,xmm2,[r8 4p0] ;xmm3 -r*r*PI*4A 
; Calculate volume = sa * r / 3 
vmulsd xmm4, xmm3, xmmO 3xmmg4 = x * x * y * PI * G 
vdivsd xmm5,xmm4,[r8 3po] sxmm5 ser*rz*r*bPl*47/3 
; Save results 
vmovsd real8 ptr [rdx],xmm3 ;save surface area 
vmovsd real8 ptr [r8],xmms ;save volume 
ret 
CalcSphereAreaVolume  endp 
end 


The declaration of function CalcSphereAreaVolume includes an argument value of type double for 
the radius and two double* pointers to return the computed surface area and volume. The surface area and 
volume of a sphere can be calculated using the following formulas: 


sa = 47° 


v=4ar [3-(sa)r /3 


Similar to the previous example, the assembly language code begins with a . const section that defines 
several constants. The text real8 is a MASM directive that defines storage space for a double-precision 
floating-point value. At entry to CalcSphereAreaVolume , XMMO contains the sphere radius. The vmulsd 
xmm1 , xmmo , xmmO instruction squares the radius and saves the result to XMM1. Execution of this instruction 
also copies the upper 64 bits of XMMO to the same positions in XMMI (i.e., XMMO0[127:64] is copied to 
XMM1[127:64]). The ensuing vnulsd xmm2,xmm1,[r8 PI] and vmulsd xmm3,xmm2,[r8 4p0] instructions 
calculater * r * PI * 4, which yields the surface area of the sphere. 

The next two instructions, vmulsd xmm4,xmm3,xmmO and vdivsd xmm5,xmm4, [r8 3p0], calculate the 
sphere volume. The vmovsd real8 ptr [rdx],xmm3 and vmovsd real8 ptr [r8],xmms instructions save 
the calculated surface area and volume values to the specified buffers. Note that the pointer arguments sa 
and vol were passed to CalcSphereAreaVolume in registers RDX and R8. When a function uses a mixture of 
integer (or pointer) and floating-point arguments, the position of the argument in the function declaration 
determines which general-purpose or XMM registers get used. You'll learn more about this aspect of the 
Visual C++ calling convention later in this chapter. Here is the output for example Ch05 02. 


--------- Results for CalcSphereAreaVol ----------- 
O0 sa: O vol: 0 
1 sa: 12.566371 vol: 4.1887902 
2 sa: 50.265482 vol: 33.510322 
3 sa: 113.09734 vol: 113.09734 
5 Sa:  314.15927 vol:  523.59878 
0 sa: 1256.6371 vol: 4188.7902 
20 sa: 5026.5482 vol: 33510.322 
32 sa:  12867.964 vol:  137258.28 


He pe pe pe p. p. p.e pe 
- Ouid u€uK.éeD|O 
HHHHHHHH 
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Listing 5-3 (example Ch05_03) contains the code for the next source code example, which also illustrates 
how to carry out calculations using double-precision floating-point arithmetic. In this example, the assembly 
language function CalcDistance calculates the Euclidian distance between two points in 3D space using 
the following equation: 


Listing 5-3. Example Ch05 03 


#include 
#include 
#include 
#include 
#include 


"stdafx.h" 
«iostream» 
«iomanip» 
«random» 
«cmath» 


using namespace std; 


extern "C" double CalcDistance (double x1, double yi, double z1, double x2, double y2, 


double z2); 


void Init(double* x, double* y, double* z, size t n, unsigned int seed) 


( 
uniform int distribution«» ui dist (1, 100}; 
default random engine rng {seed}; 
for (size t i = 0; i< n; i++) 
{ 
x[i] = ui dist(rng); 
y[i] = ui dist(rng); 
z[i] = ui dist(rng); 
) 
) 
double CalcDistanceCpp(double x1, double yi, double zi, double x2, 
{ 
double tx = (x2 - x1) * (x2 - x1); 
double ty = (y2 - y1) * (y2 - y1); 
double tz = (z2 - z1) * (z2 - z1); 
double dist = sqrt(tx + ty + tz); 
return dist; 
} 
int main() 
{ 


const size t n = 20; 
double x1[n], y1[n], z1[n]; 
double x2[n], y2[n], z2[n]; 


double y2, double z2) 
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double disti[n]; 
double dist2[n]; 


Init(x1, yi, zi, n, 29); 
Init(x2, y2, z2, n, 37); 


for (size t i = 0; i< nj; i++) 

i 
disti[i] = CalcDistanceCpp(x1[i], yi[i], z1[i], x2[i], y2[i], z2[i]); 
dist2[i] = CalcDistance (x1[i], y1[i], z1[i], x2[i], y2[i], z2[i]); 


} 


cout << fixed; 


for (size t i = 0; i< n; i++) 


{ 


cout << "i: " << setw(2) << i «« x 
cout << setprecision(0); 


cout << "p1("; 
cout << setw(3) << x1[i] << ","; 
cout << setw(3) << y1[i] << ","; 
cout << setw(3) << z1[i] << ") | 


cout << "p2("; 
cout << setw(3) << x2[i] << ","; 
cout << setw(3) << y2[i] << ","; 
cout << setw(3) << z2[i] << ") | 


cout << setprecision(4); 
cout << "disti: " << setw(8) << dista[i] << " | "; 
cout << "dist2: " << setw(8) << dist2[i] << ‘\n'; 


} 


return 0; 


> semen ee ee ee ee eee eee eee ee eee ee ee eee eee ee eee eee a 
; Ch05 03.asm 

H ni ie i” nl i it i yo ti ii et i 
; extern "C" double CalcDistance (double x1, double y1, double z1, double x2, double y2, 
double z2) 


.code 

CalcDistance proc 

; Load arguments from stack 
vmovsd xmm4,real8 ptr [rsp+40] ;xmm4 
vmovsd xmm5,real8 ptr [rsp448] ;xmm5 


y2 
z2 
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; Calculate squares of coordinate distances 


vsubsd xmmo,xmm3,xmmo ;xmmo = x2 - x1 
vmulsd xmmo, xmmo, xmmo ;xmmo = (x2 - x1) * (x2 - x1) 
vsubsd xmm1,xmm4, xmm1 ;xmm1 = y2 - y1 
vmulsd xmm1,xmm1,xmm1 ;xmmi = (y2 - y1) * (y2 - y1) 
vsubsd xmm2,xmm5 , xmm2 ;xmm2 = z2 - z1 
vmulsd xmm2,xmm2,xmm2 ;xmm2 = (z2 - z1) * (z2 - z1) 


; Calculate final distance 
vaddsd xmm3,xmmoO, xmm1 


vaddsd xmm4,xmm2, xmm3 ;xmm4 - sum of squares 
vsqrtsd xmmo,xmmo,xmm4 ;xmmO = final distance value 
ret 

CalcDistance_ endp 
end 


If you examine the declaration of function CalcDistance_, you will notice that it specifies six double 
precision argument values. The argument values X1, y1, z1, and x2 are passed in registers XMMO, XMMI, 
XMM2, and XMM3, respectively. The final two argument values, y2 and z2, are passed on the stack, as 
illustrated in Figure 5-1. Note that this figure shows only the low-order quadword of each XMM register; the 
high-order quadwords are not used to pass argument values and are undefined. 


ngn XMM3 
Memory 

XMM2 

XMM1 

XMMO 


ee 
" 
RDX Home nee 
[REX Home | P 
Low 
Memory Return Address 


Stack Registers 


[zm] = Undefined 


Figure 5-1. Stack layout and argument registers at entry to CalcDistance | 


The function CalcDistance begins with a vmovsd xmm4,real8 ptr [rsp+40] instruction that loads 
argument value y2 from the stack into register XMMA. This is followed by a vmovsd xmm5,real8 ptr 
[rsp+48] instruction that loads argument value z2 into register XMM5. The next two instructions, vsubsd 
xmmo , xmm3 , xmmO and vmulsd xmmO,xmmO, xmmO, calculate (x2 - x1) * (x2 - x1). A similar sequence of 
instructions is then used to calculate (y2 - y1) * (y2 - y1) and (z2 - z1) * (z2 - z1). This is followed 
by two vaddsd instructions that sum the three coordinate squares. A vsqrtsd xmmO,xmmO, xmm4 instruction 
computes the final distance. Note that the vsqrtsd instruction computes the square root of its second source 
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operand. Similar to other scalar double-precision floating-point arithmetic instructions, vsqrtsd also copies 
bits 127:64 of the first source operand to the same bit positions of the destination operand. Here is the output 
for example Ch05 03: 


i: O p1( 86, 84, 5) | p2( 32, 8, 77) | disti: 117.7964 | dist2: 117.7964 
i: 1 p1( 38, 63, 77) | p2( 28, 49, 86) | disti: 19.4165 | dist2: 19.4165 
i: 2 p1( 17, 18, 54) | p2( 79, 51, 80) | disti: 74.8933 | dist2: 74.8933 
i: 3 p1( 85, 50, 28) | p2( 40, 87, 90) | disti: 85.0764 | dist2: 85.0764 
i: 4 p1( 98, 47, 79) | p2( 28, 85, 38) | disti: 89.5824 | dist2: 89.5824 
i: 5 p1( 21, 78, 36) | p2( 92, 12, 47) | disti: 97.5602 | dist2: 97.5602 
i: 6 p1( 16, 50, 97) | p2( 61, 13, 40) | disti: 81.5046 | dist2: 81.5046 
i: 7 p1( 31, 96, 49) | p2( 31, 37, 45) | disti: 59.1354 | dist2: 59.1354 
i: 8 p1( 13, 87, 40) | p2( 95, 41, 87) | disti: 105.1142 | dist2: 105.1142 
i: 9 p1( 35, 48, 4) | p2( 26, 13, 43) | disti: 53.1695 | dist2: 53.1695 
i: 10 p1( 43, 56, 85) | p2( 88, 17, 45) | disti: 71.7356 | dist2: 71.7356 
i: 11 p1( 59, 88, 77) | p2( 26, 11, 72) | disti: 83.9226 | dist2: 83.9226 
i: 12 p1( 56, 48, 71) | p2( 3, 56, 81) | disti: 54.5252 | dist2: 54.5252 
i: 13 p1( 97, 19, 11) | p2( 36, 35, 58) | disti: 78.6511 | dist2: 78.6511 
i: 14 p1( 50, 79, 74) | p2( 60, 7, 32) | disti: 83.9524 | dist2: 83.9524 
i: 15 p1( 84, 16, 29) | p2( 91, 4, 91) | disti: 63.5374 | dist2: 63.5374 
i: 16 p1( 67, 77, 65) | p2( 86, 47, 59) | disti: 36.0139 | dist2: 36.0139 
i: 17 p1( 67, 1, 3) | p2( 34, 19, 64) | disti: 71.6519 | dist2: 71.6519 
i: 18 p1( 41, 79, 73) | p2( 17, 2, 68) | disti: 80.8084 | dist2: 80.8084 
i: 19 p1( 86, 40, 66) | p2( 76, 12, 61) | disti: 30.1496 | dist2: 30.1496 


Scalar Floating-Point Compares and Conversions 


Any function that carries out basic floating-point arithmetic is also likely to perform floating-point compare 
operations and conversions between integer and floating-point values. The sample source code of this 
section illustrates how to perform scalar floating-point compares and data conversions. It begins with a 
couple of examples that demonstrate methods for comparing two floating-point values and making a logical 
decision based on the result. This is followed by an example that shows floating-point conversion operations 
using values of different types. 


Floating-Point Compares 


Listing 5-4 shows the source code for example Ch05 04, which demonstrates the use of the floating-point 
compare instructions vcomis [d | s]. Similar to the AVX scalar floating-point arithmetic instructions, the 
finalletter of these mnemonics denotes the operand type (d - double-precision, s - single-precision). 

The vcomis [d|s] instructions compare two floating-point operands and set status flags in RFLAGS to 

signify a result of less than, equal, greater than, or unordered. An unordered floating-point compare is true 
when one or both of the instruction operands is a NaN or erroneously encoded. The assembly language 
functions CompareVCOMISD and CompareVCOMISS illustrate the use of the vcomisd and vcomiss instructions, 
respectively. In the discussions that follow, I'll describe the workings of CompareVCOMISS ; any comments 
made about this function also apply to CompareVCOMISD . 
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Listing 5-4. Example Ch05 04 


#include "stdafx.h" 
#include «string» 
#include <iostream> 
#include «iomanip» 
#include «limits» 


using namespace std; 


extern "C" void CompareVCOMISS (float a, float b, bool* results); 
extern "C" void CompareVCOMISD (double a, double b, bool* results); 


const char* c OpStrings[] = ("UO", "LT", "LE", "EQ", "NE", "GT", "GE"}; 
const size t c NumOpStrings - sizeof(c OpStrings) / sizeof(char*); 


const string g Dashes(72, '-'); 
template «typename T» void PrintResults(T a, T b, const bool* cmp results) 


cout << ‘a = ««a«« ", 5j 
cout << "b=" << b << '\n'; 


for (size t i = 0; i « c NumOpStrings; i++) 


( 
cout << c OpStrings[i] << '='; 
cout << boolalpha << left << setw(6) << cmp results[i] << ' '; 
} 
cout << "\n\n"; 
} 
void CompareVCOMISS() 
{ 


const size t n = 6; 
float a[n] (120.0, 250.0, 300.0, -18.0, -81.0, 42.0); 
float b[n] (130.0, 240.0, 300.0, 32.0, -100.0, 0.0); 


// Set NAN test value 
b[n - 1] = numeric limits«float»::quiet NaN(); 


cout << "\nResults for CompareVCOMISS\n"; 
cout << g Dashes << '\n'; 


for (size t i = 0; i< n; i++) 


{ 
bool cmp results[c NumOpStrings]; 
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} 


CompareVCOMISS (a[i], b[i], cmp results); 
PrintResults(a[i], b[i], cmp results); 


void CompareVCOMISD(void) 


( 


int 


3 


3 


3 


const size t n= 6; 
double a[n] {120.0, 250.0, 300.0, -18.0, -81.0, 42.0}; 
double b[n] {130.0, 240.0, 300.0, 32.0, -100.0, 0.0}; 


// Set NAN test value 
b[n - 1] = numeric limits«double»::quiet NaN(); 


cout << "\nResults for CompareVCOMISD An"; 
cout << g Dashes << '\n'; 


for (size t i = 0; i< n; i++) 
{ 
bool cmp results[c NumOpStrings]; 


CompareVCOMISD (a[i], b[i], cmp results); 
PrintResults(a[i], b[i], cmp results); 


main() 
CompareVCOMISS(); 


CompareVCOMISD() ; 
return 0; 


; extern "C" void CompareVCOMISS (float a, float b, bool* results); 


.code 


CompareVCOMISS proc 


; Set result flags based on compare status 
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vcomiss xmmo,xmm31 

setp byte ptr [r8] ;RFLAGS.PF - 1 if unordered 
jnp @F 

xor al,al 

mov byte ptr [r8+1],al ;Use default result values 
mov byte ptr [r8+2],al 

mov byte ptr [r8+3],al 

mov byte ptr [r8+4],al 


QQ: 


Done: 


mov byte ptr [r8+5],al 
mov byte ptr [r846],al 
jmp Done 


setb byte ptr [r8+1] 
setbe byte ptr [r842] 
sete byte ptr [r8+3] 
setne byte ptr [r844] 
seta byte ptr [r8+5] 
setae byte ptr [r846] 


ret 


CompareVCOMISS  endp 


; extern "C" void CompareVCOMISD (double a, 


CompareVCOMISD proc 


; Set result flags based on compare status 


vcomisd xmmo,xmm31 


;set 
;set 
;set 
;set 
;set 
;set 


double b, 
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byte 
byte 
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byte 


if 
if 
if 
if 
if 
if 
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bool* results); 


setp byte ptr [r8] 


;RFLAGS.PF - 


;Use default 


1 if unordered 


result values 


jnp GF 
xor al,al 
mov byte ptr [r8+1],al 
mov byte ptr [r8+2],al 
mov byte ptr [r8+3],al 
mov byte ptr [r8+4],al 
mov byte ptr [r8+5],al 
mov byte ptr [r846],al 
jmp Done 
QQ : setb byte ptr [r8+1] 
setbe byte ptr [r842] 
sete byte ptr [r8+3] 
setne byte ptr [r8+4] 
seta byte ptr [r8+5] 
setae byte ptr [r846] 
Done: ret 


CompareVCOMISD  endp 


end 


;set 
;set 
;set 
;set 
;set 
;set 


byte if 


byte 
byte 
byte 
byte 
byte 


if 
if 
if 
if 
if 


D C& €) CU VV WY 


The function CompareVCOMISS accepts two argument values of type float and a pointer to an array 


of bools for the compare results. The first instruction of CompareVCOMISS ,vcomiss xmmO,xmm1, performs 
a single-precision floating-point compare of argument values a and b. Note that these values were passed 
to CompareVCOMISS in registers XMMO0 and XMM1. Execution of vcomiss sets RFLAGS.ZE, RFLAGS.PE 
and RFLAGS.ZE, as shown Table 5-1. The setting of these status flags facilitates the use of the conditional 
instructions cmovcc, jcc, and setcc, as shown in Table 5-2. 
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Table 5-1. Status Flags Set by the vcomis[d]|s] Instructions 


Condition RFLAGS.ZF RFLAGS.PF RFLAGS.CF 
XMMO > XMM1 0 0 0 
XMMO == XMM1 1 0 0 
XMMO < XMM1 0 0 1 
Unordered 1 1l 1 


Table 5-2. Condition Codes Following Execution of vcomis[d|s] 


Relational Operator Condition Code RFLAGS Test Condition 
XMMO < XMM1 Below (b) CÉ == 

XMMO <= XMM1 Below or equal (be) CF == 4 || ZF 2m 
XMMO == XMM1 Equal (e or z) ZF-se 4i 

XMMO !- XMM1 Not Equal (ne or nz) ZF == 0 

XMMO > XMM1 Above (a) CF == 0 && ZF <= 0 
XMMO >= XMM1 Above or Equal (ae) CF == 0 

Unordered Parity (p) PF == 1 


It should be noted that the status flags shown in Table 5-1 are set only if floating-point exceptions are 
masked (the default state for Visual C++) and neither vcomis [d| s] operand is a QNaN, SNaN, or denormal. 
If floating-point invalid operation or denormal exceptions are unmasked (MXCSR.IM - 0 or MXCSR.DM - 0) 
and one of the compare operands is a QNaN, SNaN, or denormal, the processor will generate an exception 
without updating the status flags in RFLAGS. Chapter 4 contains additional information regarding use of the 
MXCSR register, QNaNs, SNaNs, and denormals. 

Following execution ofthe vcomiss xmmo,xmm1 instruction, a series of setcc (Set Byte on Condition) 
instructions are used to highlight the relational operators shown in Table 5-2. The setp byte ptr [r8] 
instruction sets the destination operand byte to 1 if RFLAGS.PF is set (i.e., one of the operands is a QNaN 
or SNaN); otherwise, the destination operand byte is set to 0. If the compare was ordered, the remaining 
setcc instructions in CompareVCOMISS save all possible compare outcomes by setting each entry in array 
results to 0 or 1. As previously mentioned, functions can also use the jcc and cmovcc instructions following 
execution of a vcomis [d | s] instruction to perform program jumps or conditional data moves based on the 
outcome ofa floating-point compare. Here is the output for source code example Ch05 04: 


Results for CompareVCOMISS 


a - 120, b - 130 
UO-false LT-true  LE-true  EQ-false NE-true  GT-false GE=false 


a - 250, b - 240 
UO-false LT=false LE-false EQ=false NE-true  GT-true  GE-true 


a - 300, b - 300 
UO-false LT-false LE-true EQ=true  NE-false GT-false GE=true 
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a = -18, b = 32 
UO-false LT-true  LE-true 


a = -81, b = -100 
UO-false LT=false LE-false 


a - 42, b - nan 
UO-true  LT-false LE-false 


Results for CompareVCOMISD 


EQ=false 


EQ=false 


EQ=false 


NE=true 


NE=true 


NE=false 


CHAPTER 5 


GT=false 


GT=true 


GT=false 
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GE=false 


GE=true 


GE=false 


a = 120, b = 130 
UO-false LT-true  LE-true 


a - 250, b - 240 
UO-false LT-false LE-false 


a - 300, b - 300 
UO-false LT-false LE-true 


a--18,b- 32 
UO-false LT-true  LE-true 


a = -81, b = -100 
UO-false LT-false LE-false 


a = 42, b = nan 
UO-true  LT-false LE-false 


EQ=false 


EQ=false 


EQ=true 


EQ=false 


EQ=false 


EQ=false 


NE=true 


NE=true 


NE=false 


NE=true 


NE=true 


NE=false 


GT=false 


GT=true 


GT=false 


GT=false 


GT=true 


GT=false 


GE=false 


GE=true 


GE=true 


GE=false 


GE=true 


GE=false 


Listing 5-5 contains the source code for example Ch05 05. This example illustrates the use of the vcmpsd 
instruction, which compares two double-precision floating-point values using a compare predicate that's 
specified as an immediate operand. The vcmpsd instruction does not use any of the status bits in RFLAGS to 
indicate compare results. Instead, it returns a quadword mask of all ones or all zeros to signify a true or false 
result. The AVX instruction set also includes vcmpss, which can be used to perform single-precision floating- 
point compares. This instruction is equivalent to the vcmpsd instruction except that it returns a doubleword 


mask. 


Listing 5-5. Example Ch05 05 


#include "stdafx.h" 
#include <iostream> 
#include «iomanip» 
#include «limits» 
#include «string» 
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using namespace std; 
extern "C" void CompareVCMPSD (double a, double b, bool* results); 
const string g Dashes(40, '-'); 


int main() 


( 


const char* cmp names[] - 


{ 
"cmp eq" 5 "cmp neq" 5 n" cmp lt" P "cmp le" 
"cmp gt", "cmp ge", "cmp ord", "cmp unord" 


; 
const size t num cmp names = sizeof(cmp names) / sizeof(char*); 


const size t n = 6; 
double a[n] = {120.0, 250.0, 300.0, -18.0, -81.0, 42.0); 
double b[n] = (130.0, 240.0, 300.0, 32.0, -100.0, 0.0); 


b[n - 1] = numeric limits«double»::quiet NaN(); 


cout << "Results for CompareVCMPSD n"; 
cout << g Dashes << '\n'; 


for (size t i = 0; i < n; i++) 


{ 
bool cmp results[num cmp names]; 
CompareVCMPSD (a[i], b[i], cmp results); 
cout << "a=" << afi] «< " "; 
cout << "b = " << bli] << ‘\n'; 
for (size t j = 0; j < num cmp names; j++) 
string s1 = cmp names[j] + string(":"); 
string s2 = ((j & 1) != 0) 7 "n" : T "3 
cout «« left «« setw(12) «« s1; 
cout «« boolalpha «« setw(6) «« cmp results[j] «« s2; 
) 
cout << "An"; 
) 
return 0; 
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; Basic compare predicates 


CMP EO equ 00h 
CMP LT equ O1h 
CMP LE equ 02h 
CMP UNORD equ 03h 
CMP NEO equ 04h 
CMP NLT equ 05h 
CMP NLE equ 06h 
CMP ORD equ 07h 
; Extended compare predicates for AVX 
CMP EQU UO equ 08h 
CMP NGE equ 09h 
CMP NGT equ OAh 
CMP FALSE equ OBh 
CMP NEO 00 equ OCh 
CMP GE equ ODh 
CMP GT equ OEh 
CMP TRUE equ OFh 
CMP EQ OS equ 10h 
CMP LT O0 equ 11h 
CMP LE 00 equ 12h 
CMP UNORD S equ 13h 
CMP NEO US equ 14h 
CMP NLT UO equ 15h 
CMP NLE UO equ 16h 
CMP ORD S equ 17h 
CMP EO US equ 18h 
CMP NGE UO equ 19h 
CMP NGT UO equ 1Ah 
CMP FALSE OS equ 1Bh 
CMP NEO OS equ 1Ch 
CMP GE 00 equ 1Dh 
CMP GT O0 equ 1Eh 


CMP TRUE US equ 1Fh 


include «cmpequ.asmh» 
; extern "C" void CompareVCMPSD (double a, double b, bool* results) 


.code 
CompareVCMPSD proc 
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; Perform compare for equality 


vcmpsd xmm2,xmmO, xmm1,CMP_EQ ;perform compare operation 

vmovq rax,xmm2 ;rax = compare result (all 1s or 0s) 
and al,1 ;mask out unneeded bits 

mov byte ptr [r8],al ;save result as C++ bool 


; Perform compare for inequality 
vcmpsd xmm2,xmmo,xmm1,CMP NEO 
vmovq rax,xmm2 
and al,1 
mov byte ptr [r8+1],al 


; Perform compare for less than 
vcmpsd xmm2,xmmO, xmm1,CMP_LT 
vmovq rax,xmm2 
and al,1 
mov byte ptr [r8+2],al 


; Perform compare for less than or equal 
vcmpsd xmm2,xmmo,xmm1,CMP LE 
vmovq rax,xmm2 
and al,1 
mov byte ptr [r8+3],al 


; Perform compare for greater than 
vcmpsd xmm2,xmmO, xmm1,CMP_GT 
vmovq rax,xmm2 
and al,1 
mov byte ptr [r8+4],al 


; Perform compare for greater than or equal 
vcmpsd xmm2,xmmo,xmm1,CMP GE 
vmovq rax,xmm2 
and al,1 
mov byte ptr [r8+5],al 


; Perform compare for ordered 
vcmpsd xmm2,xmmo,xmm1,CMP ORD 
vmovq rax,xmm2 
and al,1 
mov byte ptr [r846],al 


; Perform compare for unordered 
vcmpsd xmm2,xmmo,xmm1,CMP UNORD 
vmovq rax,xmm2 
and al,1 
mov byte ptr [r8+7],al 


ret 


CompareVCMPSD endp 
end 
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Similar to the previous example, the C++ code for example Ch05 05 contains some test cases that 
exercise the assembly language function CompareVCMPSD . Following the C++ code in Listing 5-5 is the 
assembly language header file cmpequ. asmh. This file contains a collection of equate directives, which are 
used to assign symbolic names to numeric values. The equate directives in cmpequ. asmh define symbolic 
names for the compare predicates that are used by a number of x86-AVX scalar and packed compare 
instructions including vcmpsd. You'll shortly see how this works. There is no standard file extension for an 
x86 assembly language header file; I use .asmh but . inc is also frequently used. 

Using an assembly language header file is similar to using a C++ header file. In the current example, 
the statement include <cmpequ.asmh> incorporates the contents of cmpequ.asmh into the file ChO5 05 . 
asm during assembly. The angled brackets surrounding the filename can be omitted if the filename doesn't 
contain any backslashes or MASM special characters, but it's usually simpler and more consistent to just 
always use them. Besides equate statements, assembly language header files are often used for macro 
definitions. You'll learn about macros later in this chapter. 

The first instruction of function CompareVCMPSD_, vcmpsd xmm2,xmmo,xmm1,CMP EO, compares the 
contents of registers XMMO and XMM1 for equality. These registers contain argument values a and b. If a 
and b are equal, the low-order quadword of XMM2 is set to all ones; otherwise, the low-order quadword is 
set to all zeros. Note that the vcmpsd instruction requires four operands: an immediate operand that specifies 
the compare predicate, two source operands (the first source operand must be an XMM register while the 
second source operand can be an XMM register or an operand in memory), and a destination operand 
that must be an XMM register. The ensuing vmovq rax,xmm2 instruction copies the low-order quadword 
of XMM2 (which contains all zeros or all ones) to register RAX. This is followed by an and al,1 instruction 
that sets register AL to 1 if the compare predicate is true; otherwise AL is set to 0. The final instruction of 
the sequence, mov byte ptr [r8],al,saves the compare outcome to the array results. The function 
CompareVCMPSD then uses similar instruction sequences to demonstrate other frequently-used compare 
predicates. Here are the results for example Ch05 05: 


Results for CompareVCMPSD 


cmp eq: false cmp neq: true 
cmp lt: true cmp le: true 
cmp gt: false  cmp ge: false 
cmp ord: true cmp unord: false 


cmp eq: false  cmp neq: true 
cmp lt: false  cmp le: false 
cmp gt: true cmp ge: true 
cmp ord: true cmp unord: false 


cmp eq: true cmp neq: false 
cmp lt: false  cmp le: true 
cmp gt: false  cmp ge: true 
cmp ord: true cmp unord: false 


cmp eq: false  cmp neq: true 
cmp lt: true cmp le: true 
cmp gt: false  cmp ge: false 
cmp ord: true cmp unord: false 


127 


CHAPTER 5 bn AVX PROGRAMMING — SCALAR FLOATING-POINT 


a = -81 b= -100 


cmp_eq: false cmp_neq: true 
cmp lt: false  cmp le: false 
cmp gt: true cmp ge: true 
cmp ord: true cmp unord: false 


a-42 b= nan 


cmp eq: false | cmp neq: true 
cmp lt: false  cmp le: false 
cmp gt: false  cmp ge: false 
cmp ord: false  cmp unord: true 


Many x86 assemblers including MASM support pseudo-op forms of the vcmpsd instruction and its 
single-precision counterpart vcmpss. Pseudo-ops are simulated instruction mnemonics with the compare 
predicate embedded within the mnemonic text. In function CompareVCMPSD , for example, the pseudo-op 
vcmpeqsd xmm2,xmmO,xmm1 could have been used instead of the instruction vcmpsd xmm2,xmmO, xmm1, CMP - 
EO. Personally, I find the standard reference manual mnemonics easier to read since the compare predicate 
is explicitly specified as an operand instead being buried within the pseudo-op, especially when using one of 
the more esoteric compare predicates. 

In this section, you learned how to perform compare operations using the vcomi [d| s] and vcmps [d|s] 
instructions. You might be wondering at this point which compare instructions should be used. For basic 
scalar floating-point compare operations (e.g., equal, not equal, less than, less than or equal, greater than, 
and greater than or equal), the vcomis [d|s] instructions are slightly simpler to use since they directly set 
the status flags in RFLAGS. The vcmps [d|s] instructions must be used to take advantage of the extended 
compare predicates that AVX supports. Another reason for using the vcmps [d | s] instructions is the 
similarity between these instructions and the corresponding vcmpp[d| s ] instructions for packed 
floating-point operands. You'll learn how to use the packed floating-point compare instructions in 
Chapter 6. 


Floating-Point Conversions 


A common operation in many C++ programs is to cast a single-precision or double-precision floating- 
point value to an integer or vice versa. Other frequent operations include the promotion ofa floating-point 
value from single-precision to double-precision and the narrowing of a double-precision value to single- 
precision. AVX includes a number of instructions that perform these types of conversions. Listing 5-6 shows 
the code for a sample program that demonstrates how to use some of the AVX conversion instructions. It 
also illustrates how to modify the rounding control field of the MXCSR register in order to change the AVX 
floating-point rounding mode. 


Listing 5-6. Example Ch05 06 


#include "stdafx.h" 
#include <iostream> 
#include «iomanip» 
#include <cstdint> 
#include «string» 
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#define USE MATH DEFINES 
include «math.h» 


using namespace std; 


// Simple union for data exchange 
union Uval 
{ 
int32 t m 132; 
int32 t m I64; 
float m F32; 
double m F64; 
5 


// The order of values below must match the jump table 
// that's defined in the .asm file. 
enum CvtOp : unsigned int 


{ 
132 F32, // int32 t to float 
F32 132, // float to int32 t 
132 F64, // int32 t to double 
F64 132, // double to int32 t 
I64 F32, // int64 t to float 
F32 I64, // float to int64 t 
I64 F64, // int64 t to double 
F64 I64, // double to int64 t 
F32 F64, // float to double 
F64 F32, // double to float 

B 


// Enumerated type for rounding mode 
enum RoundingMode : unsigned int 


{ 
js 


Nearest, Down, Up, Truncate 


const string c RoundingModeStrings[] = ("Nearest", "Down", "Up", "Truncate"}; 

const RoundingMode c RoundingModeVals[] = {RoundingMode::Nearest, RoundingMode: :Down, 
RoundingMode::Up, RoundingMode::Truncate]; 

const size t c NumRoundingModes - sizeof(c RoundingModeVals) / sizeof (RoundingMode); 


extern "C" RoundingMode GetMxcsrRoundingMode (void); 
extern "C" void SetMxcsrRoundingMode (RoundingMode rm); 
extern "C" bool ConvertScalar (Uval* a, Uval* b, CvtOp cvt op); 


int main() 


{ 


Uval src1, src2, src3, src4, src5; 


srci.m F32 
src2.m F32 


(float)M_PI; 
(float)-M_E; 
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src3.m F64 
src4.m F64 
src5.m F64 


M SQRT2; 
M SORT1_2; 
1.0 + DBL EPSILON; 


for (size t i = 0; i « c NumRoundingModes; i++) 


{ 


Uval des1, des2, des3, des4, des5; 


RoundingMode rm save 


GetMxcsrRoundingMode 


Q; 


RoundingMode rm test - c RoundingModeVals[i]; 


SetMxcsrRoundingMode (rm test); 


ConvertScalar (&des1, &srci, CvtOp::F32 132); 
ConvertScalar (&des2, &src2, CvtOp::F32 164); 
ConvertScalar (&des3, &src3, CvtOp::F64 I32); 
ConvertScalar (&des4, &src4, CvtOp::F64 I64); 
ConvertScalar (&des5, &src5, CvtOp::F64 F32); 


SetMxcsrRoundingMode (rm save); 


cout 
cout 


cout 
cout 


cout 
cout 


cout 
cout 


cout 
cout 


cout 
cout 
cout 


} 


return 0; 


MxcsrRcMask equ 9fffh 
MxcsrRcShift equ 13 


<< 
<< 


<< 
<< 


<< 
<< 


<< 
<< 


<< 
<< 


<< 
<< 
<< 


fixed; 


"\nRounding mode = " << c RoundingModeStrings[rm test] << 


" F32 132: " << setprecision(8); 
srci.m F32 << " --» " << des1.m 132 


" F32 164: " << setprecision(8); 
src2.m F32 << " --» " << des2.m 164 


" F64 I32: " << setprecision(8); 
src3.m F64 «« " --» " «« des3.m I32 


" F64 I64: " << setprecision(8); 
src4.m F64 «« " --» " «« des4.m I64 


" F64 F32: "5 
setprecision(16) «« src5.m F64 «« 


<< ‘\n'; 


<< ‘\n'; 


«€ "Ni 5 


<< ‘\n'; 


setprecision(8) << des5.m F32 << '\n'; 


;bit pattern for MXCSR.RC 
;shift count for MXCSR.RC 


; extern "C" RoundingMode GetMxcsrRoundingMode (void); 


3 
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"n's 


z 


3 


GetMxcsrRoundingMode proc 


vstmxcsr dword ptr [rsp+8] 
mov eax, [rsp48] 

shr eax,MxcsrRcShift 

and eax,3 

ret 


GetMxcsrRoundingMode_ endp 
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; Description: The following function obtains the current 
floating-point rounding mode from MXCSR.RC. 
Returns: Current MXCSR.RC rounding mode. 
.code 


;save mxcsr register 


;eax[1:0] = MXCSR.RC bits 
;masked out unwanted bits 


;extern "C" void SetMxcsrRoundingMode (RoundingMode rm); 


3 
3 


z 


; Description: 


value in MXCSR.RC. 


SetMxcsrRoundingMode_ proc 


and ecx,3 
shl ecx,MxcsrRcShift 


vstmxcsr dword ptr [rsp+8] 
mov eax, [rsp48] 

and eax,MxcsrRcMask 

OI eax,ecx 

mov [rsp+8],eax 

vldmxcsr dword ptr [rsp+8] 
ret 


SetMxcsrRoundingMode  endp 


3 
3 
3 


3 


; Note: 


; extern "C" bool ConvertScalar (Uval* des, 


to be explicitly set. 


ConvertScalar proc 


The following function updates the rounding mode 


;masked out unwanted bits 
;ecx[14:13] = rm 


;save current MXCSR 


;masked out old MXCSR.RC bits 
;insert new MXCSR.RC bits 


;load updated MXCSR 


const Uval* src, CvtOp cvt op) 


This function requires linker option /LARGEADDRESSAWARE :NO 


; Make sure cvt op is valid, then jump to target conversion code 


mov eax,r8d 

cmp eax,CvtOpTableCount 
jae BadCvtOp 

jmp [CvtOpTable+rax*8] 


;eax = CvtOp 


;junp if cvt op is invalid 
;jump to specified conversion 


; Conversions between int32 t and float/double 


132 F32: 


mov eax, [rdx] 
vcvtsi2ss xmmo,xmmo,eax 
vmovss real4 ptr [rcx],xmmo 


;load integer value 
;convert to float 
;save result 
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F32 132: 


132 F64: 


F64 I32: 


mov eax,1 
ret 


vmovss xmmO,real4 ptr [rdx] 
vcvtss2si eax,xmmo 

mov [rcx],eax 

mov eax,1 

ret 


mov eax, [rdx] 

vcvtsi2sd xmmO,xmmO,eax 
vmovsd real8 ptr [rcx],xmmo 
mov eax,1 

ret 


vmovsd xmmO,real8 ptr [rdx] 
vcvtsd2si eax,xmmo 

mov [rcx],eax 

mov eax,1 

ret 


;load float value 
;convert to integer 
;save result 


;load integer value 
;convert to double 
;save result 


;load double value 
;convert to integer 
;save result 


; Conversions between int64 t and float/double 


I64 F32: 


F32 I64: 


I64 F64: 


F64 164: 


132 


mov rax, [rdx] 

vcvtsi2ss xmmo,xmmo,rax 
vmovss real4 ptr [rcx],xmmo 
mov eax,1 

ret 


vmovss xmmO,real4 ptr [rdx] 
vcvtss2si rax,xmmo 

mov [rcx],rax 

mov eax,1 

ret 


mov rax, [rdx] 

vcvtsi2sd xmmO,xmmO,rax 
vmovsd real8 ptr [rcx],xmmo 
mov eax,1 

ret 


vmovsd xmmo,real8 ptr [rdx] 
vcvtsd2si rax,xmmo 
mov [rcx],rax 


;load integer value 
;convert to float 
;save result 


;load float value 
;convert to integer 
;save result 


;load integer value 
;convert to double 
;save result 


;load double value 
;convert to integer 
;save result 


CHAPTER 5 = AVX PROGRAMMING — SCALAR FLOATING-POINT 


mov eax,1 
ret 


; Conversions between float and double 


F32 F64: 
vmovss xmmO,real4 ptr [rdx] ;load float value 
vcvtss2sd xmm1,xmm1,xmmO ;convert to double 
vmovsd real8 ptr [rcx],xmmi ;save result 
mov eax,1 
ret 

F64 F32: 
vmovsd xmmO,real8 ptr [rdx] ;load double value 
vcvtsd2ss xmm1,xmm1, xmmo ;convert to float 
vmovss real4 ptr [rcx],xmmi ;save result 
mov eax,1 
ret 

BadCvtOp: 
xor eax,eax ;set error return code 
ret 


; The order of values in following table must match the enum CvtOp 
; that's defined in the .cpp file. 


align 8 
CvtOpTable equ $ 
qword I32 F32, F32 132 
qword I32 F64, F64 I32 
qword I64 F32, F32 I64 
qword I64 F64, F64 I64 
qword F32 F64, F64 F32 
CvtOpTableCount equ ($ - CvtOpTable) / size qword 


ConvertScalar endp 
end 


Near the top of the C++ code is a declaration for union Uval, which is used for data exchange purposes. 
This is followed by two enumerations: one to select a floating-point conversion type (CvtOp) and another to 
specify a floating-point rounding mode (RoundingMode). The C++ function main initializes a couple of Uval 
instances as test cases and invokes the assembly language function ConvertScalar to perform various 
conversions using different rounding modes. The result of each conversion operation is then displayed for 
verification and comparison purposes. 

The AVX floating-point rounding mode is determined by the rounding control field (bits 14 and 13) 
of the MXCSR register, as discussed in Chapter 4. The default rounding mode for Visual C++ programs is 
round to nearest. According to the Visual C++ calling convention, the values in MXCSR[15:6] (i.e., MXCSR 
register bits 15 through 6) must be preserved across most function boundaries. The code in main fulfills this 
requirement by calling the function GetMxcsrRoundingMode to save the current rounding mode prior to 
performing any conversion operations using ConvertScalar .The original rounding mode is ultimately 
restored using the function SetMxcsrRoundingMode . Note that the original rounding mode is restored 
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prior to the cout statements in main. Also note that I've simplified the rounding mode save and restore 
code somewhat by not preserving the rounding mode prior to each use ConvertScalar and restoring it 
immediately afterward. 

Listing 5-6 also shows the rounding mode control functions. The function GetMxcsrRoundingMode uses 
avstmxcsr dword ptr [rsp+8] instruction (Store MXCSR RegisterState) to save the contents of MXCSR to 
the RCX home area on the stack. Recall that a function can use its home area on the stack for any transient 
storage purpose. The sole operand of the vstmxcsr instruction must be a doubleword in memory; it cannot 
be a general-purpose register. The ensuing mov eax, [rsp+8] instruction copies the current MXCSR value 
into register EAX. This is followed a shift and bitwise AND operation that extracts the rounding control bits. 
The corresponding SetMxcsrRoundingMode function uses the vldmxcsr instruction (Load MXCSR Register) 
to set a rounding mode. The vldmxcsr instruction also requires its sole operand to be a doubleword in 
memory. Note that the function SetMxcsrRoundingMode also uses the vstmxcsr instruction and some 
masking operations to ensure that only the MXCSR's rounding control bits are modified when setting a new 
rounding mode. 

The function ConvertScalar performs floating-point conversions using the specified 
numerical arguments and conversion operator. Following validation of the argument cvt op, a jmp 
[CvtOpTable+rax*8] instruction transfers control to the appropriate section in the code that performs the 
actual conversion. Note that this instruction exploits a jump table. Here, register RAX (which contains cvt_ 
op) specifies an index into the table CvtOpTable. The table CvtOpTable is defined immediately after the ret 
instruction and contains offsets to the various conversion code blocks. You'll learn more about jump tables 
in Chapter 6. 

Itis also important to note that the same instruction mnemonic is sometimes used when converting 
an integer to floating-point and vice versa. For example, the instruction vcvtsi2ss xmmO,xmmo,eax 
(located near the label I32 F32) converts a 32-bit signed integer to single-precision floating-point, and the 
instruction vcvtsi2ss xmmo,xmmo,rax (located near the label I64 F32) converts a 64-bit signed integer to 
single-precision floating-point. 

Conversions between two different numerical data types are not always possible. For example, the 
vcvtss2si instruction cannot convert large floating-point values to signed 32-bit integers. If a particular 
conversion is impossible and invalid operation exceptions (MXCSR.IM) are masked (the default for Visual 
C++), the processor sets MXCSR.IE (Invalid Operation Error Flag) and the value 0x80000000 is copied to the 
destination operand. The output for example Ch05 06 is the following: 


Rounding mode - Nearest 
F32 I32: 3.14159274 --» 3 
F32 I64: -2.71828175 --> -3 
F64 I32: 1.41421356 --» 1 
F64 164: 0.70710678 --> 1 
F64 F32: 1.0000000000000002 --> 1.00000000 


Rounding mode = Doun 
F32_132: 3.14159274 --> 3 
F32 164: -2.71828175 --> -3 
F64 132: 1.41421356 --» 1 
F64 I64: 0.70710678 --» 0 
F64 F32: 1.0000000000000002 --» 1.00000000 


Rounding mode - Up 
F32 132: 3.14159274 --» 4 
F32 164: -2.71828175 --> -2 
F64 132: 1.41421356 --> 2 
F64 164: 0.70710678 --> 1 
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F64 F32: 1.0000000000000002 --» 1.00000012 


Rounding mode - Truncate 
F32 132: 3.14159274 --> 3 
F32 I64: -2.71828175 --> -2 
F64 I32: 1.41421356 --> 1 
F64 164: 0.70710678 --» 0 
F64 F32: 1.0000000000000002 --» 1.00000000 


Scalar Floating-Point Arrays and Matrices 


In Chapter 3 you learned how to access individual elements and carry out calculations using integer arrays 
and matrices. In this section, you learn how to perform similar operations using floating-point array and 
matrices. As you'll soon see, the same assembly language coding techniques are often used for both integer 
and floating-point arrays and matrices. 


Floating-Point Arrays 


Listing 5-7 shows the code for example Ch05 07. This example illustrates how to calculate the sample mean 
and sample standard deviation of an array of double-precision floating-point values. 


Listing 5-7. Example Ch05 07 


#include "stdafx.h" 

#include <iostream> 

#include <iomanip> 

#include <cmath> 

using namespace std; 

extern "C" bool CalcMeanStdev_(double* mean, double* stdev, const double* x, int n); 


bool CalcMeanStdevCpp(double* mean, double* stdev, const double* x, int n) 


if (n < 2) 
return false; 


double sum - 0.0; 

for (int i = 0; i < n; i+) 
sum += x[i]; 

*mean = sum / n; 

double sum2 = 0.0; 


for (int i = 0; i < n; i+) 
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double temp = x[i] - *mean; 
sum2 += temp * temp; 


} 


*stdev = sqrt(sum2 / (n - 1)); 
return true; 


} 
int main() 
{ 
double x[] = { 10, 2, 33, 19, 41, 24, 75, 37, 18, 97, 14, 71, 88, 92, 7}; 
const int n = sizeof(x) / sizeof(double); 
double mean1 = 0.0, stdevi = 0.0; 
double mean2 = 0.0, stdev2 = 0.0; 
bool rci = CalcMeanStdevCpp(&mean1, &stdev1, x, n); 
bool rc2 = CalcMeanStdev (&mean2, &stdev2, x, n); 
cout «« fixed «« setprecision(2); 
for (int i = 0; i< n; i+) 
{ 
cout << "x[" << setw(2) << i << "] = "5 
cout << setw(6) << x[i] << ‘\n'; 
} 
cout << setprecision(6); 
cout << ‘\n'; 
cout << "rc1 = " << boolalpha << rc1; 
cout << "meant = " << meant << " stdev1 = " << stdevi << ‘\n'; 
cout << "rc2 = " << boolalpha << rc2; 
cout << " mean2 = " << mean2 << " stdev2 = " << stdev2 << ‘\n'; 
H —————————————————————————————— 
; Cho5_07.asm 


; extern "C" bool CalcMeanStdev(double* mean, double* stdev, const double* a, int n); 
; 
; Returns: O = invalid n, 1 = valid n 


.code 
CalcMeanStdev proc 


; Make sure 'n' is valid 


xor eax,eax ;set error return code (also i = 0) 
cmp r9d,2 
jl InvalidArg ;jump if n < 2 
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;sum - 0.0 


vaddsd xmmO,xmmO,real8 ptr [r8+rax*8]  ;sum += x[i] 


inc eax 
cmp eax,r9d 
jl GB 


vcvtsi2sd xmm1,xmm1,r9d 
vdivsd xmm3,xmmoO, xmm1 
vmovsd real8 ptr [rcx],xmm3 


; Compute sample stdev 


QQ: 


xor eax,eax 
vxorpd xmmo,xmmo, xmmo 


vmovsd xmm1,real8 ptr [r8+rax*8] 


vsubsd xmm2,xmm1, xmm3 
vmulsd xmm2,xmm2,xmm2 
vaddsd xmmo,xmmo, xmm2 
inc eax 

cmp eax,r9d 

jl GB 


dec r9d 

vcvtsi2sd xmm1,xmm1,r9d 
vdivsd xmmO,xmmO, xmm1 
vsqrtsd xmmo,xmmo, xmmo 
vmovsd real8 ptr [rdx],xmmo 


mov eax,1 


InvalidArg: 


ret 


CalcMeanStdev  endp 


end 


deviation: 


BELA 
;jump if i < n 


;convert n to DPFP 
;xmm3 = mean (sum / n) 


;save mean 
5l= 0 

;sum2 = 0.0 

;xmmi = x[i] 

;xmm2 - x[i] - mean 

;xmm2 - (x[i] - mean) ** 2 
;sum2 += (x[i] - mean) ** 2 
si 421 


sjump if i< n 


srgd=n-1 

;convert n - 1 to DPFP 
;xmmO = sum2 / (n - 1) 
;xmmo = stdev 

;save stdev 


;set success return code 


Here are the formulas that example Ch05 07 uses to calculate the sample mean and sample standard 


The C++ code for example Ch05 07 is straightforward. It includes a function named CalcMeanStdevCpp 


that calculates the sample mean and sample standard deviation of an array of double-precision floating- 
point values. Note that this function and its assembly language equivalent return the calculated mean 

and standard deviation using pointers. The remaining C++ code initializes a test array and exercises both 
calculating functions. 
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Upon entry to the assembly language function CalcMeanStdev , the number of array elements n is 
checked for validity. Note that the number of array elements must be greater than one in order to calculate a 
sample standard deviation. Following validation of n, the vxorpd, xmmo  xmmo , xmmO instruction (Bitwise XOR 
of Packed Double-Precision Floating-Point Values) initializes sum to 0.0. This instruction performs a bitwise 
XOR operation using all 128 bits of both source operands. A vxorpd instruction is used here to initialize sum 
to 0.0 since AVX does not include an explicit XOR instruction for scalar floating-point operands. 

The code block that calculates the sample mean requires only seven instructions. The first instruction 
of the summing loop, vaddsd xmmo,xmmo,real8 ptr [r8+rax*8], adds x[i] to sum. The inc eax instruction 
that follows updates i and the summing loop repeats until i reaches n. Following the summing loop, the 
instruction vcvtsi2sd xmm1,xmm1,r9d promotes a copy of n to double-precision floating-point, and the 
ensuing vdivsd xmm3,xmmO, xmm1 instruction calculates the final sample mean. The mean is then saved to 
the memory location pointed to by RCX. 

Calculation of the sample standard deviation begins with two instructions, xor eax,eax and vxorpd 
xmmO , xmmo , xmmO, that initialize i to 0 and sum2 to 0.0. The ensuing vsubsd, vmulsd, and vaddsd instructions 
calculate sum2 += (x[i] - mean) ** 2 and the summing loop repeats until all array elements have been 
processed. Execution of the dec r9d instruction yields the valuen - 1. This value is then promoted to 
double-precision floating-point by the vcvtsi2sd xmm1,xmm1,r9d instruction. The final two arithmetic 
instructions, vdivsd xmmO,xmmO, xmm1 and vsqrtsd xmmO,xmmo,xmmO, compute the sample standard 
deviation, and this value is saved to the memory location pointed to by RDX. Here's the output for example 
Cho5 07: 


x[ 0] = 10.00 
x[ 1] = 2.00 
x[ 2] = 33.00 
x[ 3] = 19.00 
x[ 4] = 41.00 
x[ 5] = 24.00 
x[ 6] = 75.00 
x[ 7] = 37.00 
x[ 8] = 18.00 
x[ 9] = 97.00 
x[10] = 14.00 
x[11] = 71.00 
x[12] = 88.00 
x[13] = 92.00 
x[14] = 7.00 
rc1 = true mean1 = 41.866667 stdev1 = 33.530086 
Ic2 = true mean2 = 41.866667 stdev2 - 33.530086 


Floating-Point Matrices 


Chapter 3 presented an example program (see Ch03_03) that carried out calculations using the elements of 
an integer matrix. In this section, you'll learn how to perform similar calculations using the elements of a 
single-precision floating-point matrix. Listing 5-8 shows the source code for example Ch05 08. 
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Listing 5-8. Example Ch05 08 


#include "stdafx.h" 
#include <iostream> 
#include «iomanip» 


using namespace std; 


extern "C" void CalcMatrixSquaresF32 (float* y, const float* x, float offset, int nrows, int 
ncols); 


void CalcMatrixSquaresF32Cpp(float* y, const float* x, float offset, int nrows, int ncols) 


{ 


for (int i = 0; i < nrows; i++) 


{ 
for (int j = 0; j < ncols; j++) 
int kx = j * ncols + i; 
int ky = i * ncols + j; 
y[ky] = x[kx] * x[kx] + offset; 
) 
) 
int main() 
{ 


const int nrows = 6; 
const int ncols = 3; 
const float offset - 0.5; 
float y2[nrows][ncols]; 
float yi[nrows][ncols]; 
float x[nrows][ncols] { { 1, 2, 35, { 4, 5, 6 }, { 7, 8, 9 }, 

{ 10, 11, 12 }, (13, 14, 15}, (16, 17, 18} }; 


[0], &x[o][o], offset, nrows, ncols); 


CalcMatrixSquaresF32Cpp(&y1[0] 
[0], &x[o][o], offset, nrows, ncols); 


CalcMatrixSquaresF32 (&y2[0] 


cout << fixed << setprecision(2); 


cout << "offset = " << setw(2) << offset << ‘\n'; 
for (int i = 0; i < nrows; i++) 
{ 


for (int j = 0; j < ncols; j++) 


cout << "y1[" << setw(2) << i << "][" << setw(2) << j << "J = "; 


cout << setw(6) << y1[i][j] << 3 
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cout << "y2[" << setw(2) << i << "][" << setw(2) << j << "J = "; 


cout << setw(6) << y2[i][j] << " s 


cout << "x[" << setw(2) << j << "][" << setw(2) << i << "] = "; 
cout << setw(6) << x[j][i] << ‘\n'; 


if (ya[1][5] != y2[i][j]) 
cout << "Compare failed Wn"; 


) 


return 0; 


; void CalcMatrixSquaresF32 (float* y, const float* x, float offset, int nrows, int ncols); 
) 


; Calculates: y[i][j] = x[j][i] * x[j][i] + offset 


.code 
CalcMatrixSquaresF32 proc frame 


; Function prolog 


push rsi ;save caller's rsi 
.pushreg rsi 
push rdi ;save caller's rdi 
.pushreg rdi 
.endprolog 
; Make sure nrows and ncols are valid 
movsxd r9,r9d 319 = nrows 
test r9,r9 
jle InvalidCount ;jump if nrows <= 0 
movsxd r10,dword ptr [rsp+56] ;110 = ncols 
test r10,r10 
jle InvalidCount ;jump if ncols <= 0 


; Initialize pointers to source and destination arrays 


mov rsi,rdx ;rsi = X 
mov rdi,rcx jrdi = y 
XOY ICX,ICX rex e 1 


; Perform the required calculations 


Loop1: xor rdx,rdx 3rdx = j 
Loop2: mov rax,rdx ;rax = j 
imul rax,r10 ;rax = j * ncols 
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add rax,rcx ;rax = j * ncols + i 
vmovss xmmO,real4 ptr [rsi«rax*4] ;xmmo = x[j][i] 
vmulss xmm2, xmmO , xmmo ;xmmi = x[j][i] * x[j][i] 
vaddss xmm3,xmm1, xmm2 ;xmm2 = x[j][i] * x[j][i] + offset 
mov rax,rcx ;rax = i 
imul rax,r10 ;rax = i * ncols 
add rax,rdx jrax = i * ncols + j; 
vmovss real4 ptr [rdi+rax*4],xmm3 ;ylilljl = x[j][i] * x[j][i] + offset 
inc rdx jj t= 1 
cmp rdx,r10 
jl Loop2 ;jump if j < ncols 
inc rcx si += 1 
cmp rcx,r9 
jl Loopi ;jump if i < nrows 
InvalidCount: 


; Function epilog 


pop rdi ;restore caller's rdi 
pop rsi ;restore caller's rsi 
ret 


CalcMatrixSquaresF32_ endp 
end 


The C++ source code that’s shown in Listing 5-8 is similar to what you saw in Chapter 3. The techniques 
used to calculate the matrix element offsets are identical. The biggest modification made to the C++ code 
was changing the appropriate matrix type declarations from int to float. Another difference between this 
example and the one you saw in Chapter 3 is the addition of the argument offset to the declarations of 
CalcMatrixSquaresF32Cpp and CalcMatrixSquaresF32_. Both of these functions now calculate y[i] [j] 
= x[j][i] * x[j][i] + offset. 

Figure 5-2 shows the stack layout and argument registers immediately following execution of the 
push rdi instruction in function CalcMatrixSquaresF32_. This figure illustrates argument passing to a 
function that uses a mixture of integer (or pointer) and floating-point arguments. Per the Visual C++ calling 
convention, the first four arguments are passed using either a general-purpose or XMM register depending 
on argument type and position. More specifically, the first argument value is passed using either register 
RCX or XMMO. The second, third, and fourth arguments are passed using RDX/XMMI, R8/XMM2, or R9/ 
XMM3. Any remaining arguments are passed on the stack. 
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RDX Home 
Oo oao = 
Low 
Memory Old RDI 


Stack Registers 


EM = undefined 


Figure 5-2. Stack layout and argument registers after execution of push rdi in CalcMatrixSquaresF32_ 


The assembly language code for function CalcMatrixSquaresF32 is similar to what you studied 
in Chapter 3. Like the C++ code, the methods used to calculate matrix element offsets are the same. The 
original matrix element calculating code used integer arithmetic and these instructions have been replaced 
with analogous AVX scalar single-precision floating-point instructions. Following calculation of the correct 
matrix element offset, the instruction vmovss xmmo,real4 ptr [rsi+rax*4] loads register XMMO with 
matrix element x[j] [i]. The ensuing vmulss xmm1,xmmO, xmmO and vaddss xmm3,xmm1, xmm2 instructions 
calculate the required result, and a vmovss real4 ptr [rdi+rax*4] , xmm3 instruction saves the result to 
y[i][j]. Here is the output for example Cho5 08. 


offset - 0.50 

y1[ 0][ o] = 1.50 y2[ o][ 0] = 1.50 x[ o][ 0] = 1.00 
y1[ 0][ 1] = 16.50 y2[ o][ 1] = 16.50 x[ 1][ 0] = 4.00 
yi[ O][ 2] = 49.50 y2[ o][ 2] = 49.50 x[ 2][ 0] = 7.00 
yi[ 1][ 0] = 4.50 y2[ 1][ 0] = 4.50 x[ o][ 1] = 2.00 
yi[ 1][ 1] = 25.50 y2[ 1][ 1] = 25.50 x[ 1][ 1] = 5.00 
yi[ 1][ 2] = 64.50 y2[ 1][ 2] = 64.50 x[ 2][ 1] = 8.00 
yı[ 2][ 0] = 9.50 y2[ 2][ 0] = 9.50 x[ o][ 2] = 3.00 
y1[ 2][ 1] = 36.50 y2[ 2][ 1] = 36.50 x[ 1][ 2] = 6.00 
yi[ 2][ 2] = 81.50 y2[ 2][ 2] = 81.50 x[ 2][ 2] = 9.00 
yı[ 3][ 0] = 16.50 y2[ 3][ 0] = 16.50 x[ o][ 3] = 4.00 
y1[ 3][ 1] = 49.50 y2[ 3][ 1] = 49.50 x[ 1][ 3] = 7.00 
y1[ 3][ 2] = 100.50 y2[ 3][ 2] = 100.50 x[ 2][ 3] = 10.00 
y1[ 4][ 0] = 25.50 y2[ 4][ 0] = 25.50 x[ o][ 4] = 5.00 
y1[ 4][ 1] = 64.50 y2[ 4][ 1] = 64.50 x[ 1][ 4] = 8.00 
yi[ 4][ 2] = 121.50 y2[ 4][ 2] = 121.50 x[2][ 4] = 11.00 
yı[ 5][ 0] = 36.50 y2[ 5][ 0] = 36.50 x[ o][ 5] = 6.00 
y1[ 5][ 1] = 81.50 y2[ 5][ 1] = 81.50 x[ 1][ 5] = 9.00 
yi[ 5][ 2] = 144.50 y2[ 5][ 2] = 144.50 x[2][ 5] = 12.00 
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Based on the source code examples in this section, it should be readily apparent that when working 
with arrays or matrices, techniques independent of the actual data type can be employed to reference 
specific elements. For-loop constructs can also be coded using methods that are detached from the actual 
data type. 


Calling Convention 


The sample source code presented thus far in this book has informally discussed various aspects of the 
Visual C++ calling convention. In this section, the calling convention is formally explained. It reiterates 
some earlier elucidations and also introduces new requirements and features that haven't been discussed. 
A basic understanding of the calling convention is necessary since it's used extensively in the sample 
code of subsequent chapters. As a reminder, if you're reading this book to learn x86-64 assembly language 
programming and plan on using it with a different operating system or high-level language, you should 
consult the appropriate documentation for information regarding the particulars of that calling convention. 
The Visual C++ calling convention designates each x86-64 CPU general-purpose register as volatile or 
non-volatile. It also applies a volatile or non-volatile classification to each XMM register. An x86-64 assembly 
language function can modify the contents of any volatile register, but must preserve the contents of any 
non-volatile register it uses. Table 5-3 lists the volatile and non-volatile general-purpose and XMM registers. 
On systems that support AVX or AVX2, the high-order 128 bits of each YMM register are classified as 
volatile. Similarly, the high-order 384 bits of registers ZMM0-ZMMIS are classified as volatile on systems 
that support AVX-512. Registers ZMM16-ZMM31 and the corresponding YMM and XMM registers are also 
designated as volatile and need not be preserved. 64-bit Visual C++ programs normally don't use the x87 
FPU. Assembly language functions that use this resource are not required to preserve the contents of the x87 
FPU register stack, which means that the entire register stack is classified as volatile. 


Table 5-3. Visual C++ 64-Bit Volatile and Non-Volatile Registers 


Register Group Volatile Registers Non-Volatile Registers 


General-purpose RAX, RCX, RDX, R8, R9, R10, R11 RBX, RSI, RDI, RBP, RSP, 
R12, R13, R14, R15 


XMM XMMO - XMM5 XMM6 - XMM15 


The programming requirements imposed on an x86-64 assembly language function by the Visual C++ 
calling convention vary depending on whether the function is a leaf or non-leaf function. Leaf functions are 
functions that: 


e  Donotcallany other functions. 

e Do not modify the contents of the RSP register. 

e  Donotallocate any local stack space. 

e Do not modify any of the non-volatile general-purpose or XMM registers. 
e Donot use exception handling. 


64-bit assembly language leaf functions are easier to code, but they’re only suitable for relatively 
simple computational tasks. A non-leaf function can use the entire x86-64 register set, create a stack frame, 
allocate local stack space, or call other functions provided it complies with the calling convention’s precise 
requirements for prologs and epilogs. The sample code of this section exemplifies these requirements. 
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In the remainder of this section, you'll examine four source code examples. The first three examples 
illustrate how to code non-leaf functions using explicit instructions and assembler directives. These 
programs also convey critical programming information regarding the organization of a non-leaf function 
stack frame. The fourth example demonstrates how to use several prolog and epilog macros. These macros 
help automate most of the programming labor that's associated with non-leaf functions. 
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Basic Stack Frames 


Listing 5-9 shows the source code for example Ch05 09. This program demonstrates how to initialize a stack 
frame pointer in an assembly language function. Stack frame pointers are used to reference argument values 
and local variables on the stack. Example Ch05 09 also illustrates some of the programming protocols that 
an assembly language function prolog and epilog must observe. 


Listing 5-9. Example Ch05 09 


#include 
#include 
#include 


"stdafx.h" 
«iostream» 
«cstdint» 


using namespace std; 


extern "C" int64 t Cc1 (int8 t a, int16 t b, int32 t c, int64 t d, 


int32 t g, int64 t h); 


int main() 


{ 


int8_ 


t a = 10, e = -20; 


int16 t b = -200, f = 400; 
int32_t c = 300, g = -600; 
int64 t d = 4000, h = -8000; 


int64 t sum = Cc1 (a, b, c, d, e, f, g, h); 


const char nl = ‘\n'; 


cout 


cout 
cout 
cout 
cout 
cout 
cout 
cout 
cout 
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<< "Results for Cc1\n\n"; 


<< "a = " << (int)a << nl; 
<< "b = " << b << nl; 
<< "c-"««c««nlj 
<< "d=" << d << nl; 
<< "e =" << (int)e << nl; 
«c "f s"ccf«cnl; 
<< "g="<« g << nl; 
<< "he" <<h << nl; 
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cout << "sum = " << sum << nl; 


return 0; 


> ee 
3 Cho5_09.asm 
mm 
; extern "C" Int64 Cc1 (int8 t a, int16 t b, int32 t c, int64 t d, int8 t e, int16 t f, 
int32 t g, int64 t h); 


.code 
Cc1_ proc frame 


; Function prolog 
push rbp ;save caller's rbp register 
.pushreg rbp 


sub rsp,16 ;allocate local stack space 
.allocstack 16 


mov rbp,rsp ;set frame pointer 
.setframe rbp,0 


RBP RA = 24 soffset from rbp to return addr 
.endprolog ;mark end of prolog 


; Save argument registers to home area (optional) 
mov [rbp+RBP_RA+8],rcx 

mov [rbp+RBP_RA+16],rdx 

mov [rbp+RBP_RA+24],r8 

mov [rbp+RBP_RA+32],19 


; Sum the argument values a, b, c, and d 


movsx rcx,cl jrcx = a 

movsx rdx,dx jrdx = b 

movsxd r8,r8d 518 = Cc; 

add rcx,rdx jycx = a+b 

add r8,r9 518=c+d 

add r8,rcx 318-a+b+c+d 
mov [rbp],r8 save a+b+c+d 


; Sum the argument values e, f, g, and h 
movsx rcx,byte ptr [rbp+RBP RA-40] ;rcx =e 
movsx rdx,word ptr [rbp+RBP RA448] ;rdx = f 
movsxd r8,dword ptr [rbp+RBP_RA+56] ;r8 = g 


add rcx,rdx jrycx - e & f 
add r8,qword ptr [rbp+RBP_RA+64] 318=g+h 
add r8,rcx 318=-e+f+g+h 
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; Compute the final sum 
mov rax,[rbp] jyax = a+b+c+d 
add rax,r8 ;rax = final sum 


; Function epilog 


add rsp,16 ;release local stack space 
pop rbp ;restore caller's rbp register 
ret 

Cci endp 
end 


The purpose ofthe C++ code in Listing 5-9 is to initialize a test case for the assembly language function 
Cc1_. This function calculates and returns the sum of its eight signed-integer argument values. The results 
are then displayed using a series stream writes to cout. 

In the assembly language code, the Cc1 proc fame statement marks the beginning of function Cc1 . 
The frame attribute notifies the assembler that the function Cc1 uses a stack frame pointer. It also instructs 
the assembler to generate static table data that the Visual C++ runtime environment uses to process 
exceptions. The ensuing push rbp instruction saves the caller's RBP register on the stack since function 
Cc1 uses this register as its stack frame pointer. The .pushreg rbp statement that follows is an assembler 
directive that saves offset information about the push rbp instruction in the exception handling tables. Keep 
in mind that assembler directives are not executable instructions; they are directions to the assembler on 
how to perform specific actions during assembly of the source code. 

A sub rsp,16 instruction allocates 16 bytes of stack space for local variables. The function Cc1_ only 
uses eight bytes of this space, but the Visual C++ calling convention requires non-leaf functions to maintain 
16-byte alignment of the stack pointer outside of the prolog. You'll learn more about stack pointer alignment 
requirements later in this section. The next statement, .allocstack 16, is an assembler directive that saves 
local stack size allocation information in the runtime exception handling tables. 

The mov rbp,rsp instruction initializes register RBP as the stack frame pointer, and the . setframe 
rbp,0 directive notifies the assembler of this action. The offset value 0 that's included in the . setframe 
directive is the difference in bytes between RSP and RBP. In function Cc1 , registers RSP and RBP are the 
same so the offset value is zero. Later in this section, you learn more about the . setframe directive. It should 
be noted that assembly language functions can use any non-volatile register as a stack frame pointer. Using 
RBP provides consistency between x86-64 and legacy x86 assembly language code. The final assembler 
directive, . endprolog, signifies the end of the prolog for function Cc1 . Figure 5-3 shows the stack layout and 
argument registers following completion of the prolog. 
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High 
Memory 


Memory —RSP, RBP 


Stack Registers 


LJ] = Undefined 
Figure 5-3. Stack layout and argument registers of function Cc1 following completion of prolog 


The RBP RA = 24 statement is a directive similar to an equate that assigns the value 24 to the symbol 
named RBP RA. This represents the extra offset bytes (compared to a standard leaf function) needed to 
correctly reference the home area of Cc1 , as shown in Figure 5-3. The next block of instructions saves 
registers RCX, RDX, R8, and R9 to their respective home areas on this stack. This step is optional and 
included in Cc1_ for illustrative purposes. Note that the offset of each mov instruction includes the symbolic 
constant RBP RA. Another option allowed by the Visual C++ calling convention is to save an argument 
register to its corresponding home area prior to the push rbp instruction using RSP as a base register (e.g., 
mov [rsp+8],rcx, mov [rsp+16],rdx, and so on). Also keep in mind that a function can use its home area 
to store other temporary values. When used for alternative storage purposes, the home area should not be 
referenced by an assembly language instruction until after the . endprolog directive. 

Following the home area save operation, the function Cc1_ sums argument values a, b, c, and d. It 
then saves this intermediate sum to LocalVar1 on the stack using a mov [rbp],r8 instruction. Note that 
the summation calculation sign-extends argument values a, b, and c using a movsx or movsxd instruction. 

A similar sequence of instructions is used to sum argument values e, f, g, and h, which are located on the 
stack and referenced using the stack frame pointer RBP and a constant offset. The symbolic constant RBP RA 
is also used here to account for the extra stack space needed to reference argument values on the stack. The 
two intermediate sums are then added to produce the final result in register RAX. 

A function epilog must release any local stack storage space that was allocated in the prolog, restore 
any non-volatile registers that were saved on the stack, and execute a function return. The add rsp,16 
instruction releases the 16 bytes of stack space that Cc1_ allocated in its prolog. This is followed by a pop rbp 


147 


CHAPTER 5 bn AVX PROGRAMMING — SCALAR FLOATING-POINT 


instruction, which restores the caller's RBP register. The obligatory ret instruction is next. Here is the output 
for example Ch05 09: 


Results for Cc1 


= 10 

= -200 

= 300 

= 4000 
-20 

= 400 

= -600 

= -8000 
sum = -4110 


so + D an ctu 
M 


Using Non-Volatile General-Purpose Registers 


The next sample program is named Ch05 10 and demonstrates how to use the non-volatile general-purpose 
registers in a 64-bit assembly language function. It also provides additional programming details regarding 
stack frames and the use of local variables. Listing 5-10 shows the C++ and assembly language source code 
for sample program ChO5 10. 


Listing 5-10. Example Ch05 10 


#include "stdafx.h" 
#include <iostream> 
#include «iomanip» 
#include <cstdint> 


using namespace std; 


extern "C" bool Cc2 (const int64 t* a, const int64 t* b, int32 t n, int64 t * sum a, 
int64 t* sum b, int64 t* prod a, int64 t* prod b); 


int main() 

( 
const int n = 6; 
int64 t a[n] = { 2, -2, -6, 7, 12, 5 }; 
int64 t b[n] = { 3, 5, -7, 8, 4, 9 }; 
int64 t sum a, sum b; 
int64 t prod a, prod b; 


bool rc = Cc2 (a, b, n, &sum a, &sum b, &prod a, &prod b); 


cout << "Results for Cc2\n\n"; 
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const int w = 6; 
const char nl = ‘\n'; 


const char* ws = 3 


for (int i = 0; i< n; i+) 


{ 
cout << "i: " << setw(w) << i << ws; 
cout << "a: " << setw(w) << a[i] << ws; 
cout << "b: " << setw(w) << b[i] << nl; 
} 
cout << nl; 
cout << "sum a = " << setw(w) << sum a << ws; 
cout << "sum b = " << setw(w) << sum b << nl; 
cout << "prod a = " << setw(w) << prod a << ws; 
cout «« "prod b - " «« setw(w) «« prod b «« nl; 
) 
else 
cout << "Invalid return code\n"; 
return 0; 


3 
3 


3 


; extern "C" void Cc2 (const int64 t* a, const int64 t* b, int32 t n, int64 t* sum a, 
int64 t* sum b, int64 t* prod a, int64 t* prod b) 


; Named expressions for constant values: 


; NUM PUSHREG 
; STK LOCAL1 
; STK LOCAL2 
; STK PAD 

; STK TOTAL 

; RBP RA 


NUM PUSHREG 
STK LOCAL1 
STK_LOCAL2 


STK_PAD 


STK_TOTAL 


RBP_RA 


.const 


number of prolog non-volatile register pushes 

size in bytes of STK LOCAL1 area (see figure in text) 
size in bytes of STK LOCAL2 area (see figure in text) 
extra bytes (0 or 8) needed to 16-byte align RSP 
total size in bytes of local stack 

number of bytes between RBP and ret addr on stack 


4 
32 

16 

((NUM_PUSHREG AND 1) XOR 1) * 8 
STK_LOCAL1 + STK LOCAL2 + STK PAD 

NUM PUSHREG * 8 + STK LOCAL1 + STK PAD 


TestVal db 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 
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Cc2_ 


. code 
proc frame 


; Save non-volatile GP registers on the stack 


push rbp 
.pushreg rbp 
push rbx 
.pushreg rbx 
push r12 
.pushreg r12 
push r13 
.pushreg r13 


; Allocate local stack space and set frame pointer 
sub rsp,STK TOTAL 
.allocstack STK TOTAL 


lea rbp,[rsp+STK LOCAL2] 
.setframe rbp,STK LOCAL2 


.endprolog 


;allocate local stack space 


;set frame pointer 


;end of prolog 


; Initialize local variables on the stack (demonstration only) 


vmovdqu xmm5, xmmword ptr [TestVal] 
vmovdqa xmmword ptr [rbp-16],xmm5 


mov qword ptr 
mov qword ptr 
mov qword ptr 
mov qword ptr 


[rbp],0aah 
[rbp+8],Obbh 
[rbp+16],0cch 
[rbp+24],0ddh 


;save xmm5 to LocalVar2A/2B 
;save Oxaa to LocalVar1A 
;save Oxbb to LocalVar1B 
;save Oxcc to LocalVar1C 
;save Oxdd to LocalVar1D 


; Save argument values to home area (optional) 


mov qword ptr 
mov qword ptr 
mov qword ptr 


[rbp+RBP_RA+8],rcx 
[rbp+RBP_RA+16] , rdx 
[rbp+RBP_RA+24],r8 
mov qword ptr [rbp«RBP RA«32],r9 


; Perform required initializations for processing loop 


; Compute the array sums and products 


QQ: 
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test r8d,r8d 
jle Error 


xor rbx,rbx 
xor r10,r10 
xor r11,r11 
mov r12,1 
mov r13,1 


mov rax, [rcxerbx] 


add r10,rax 
imul r12,rax 


mov rax, [rdxerbx] 


jis n <= 0? 
;jump if n <= 0 


srbx = current element offset 
3110 = sum a 

5111 = sum b 

5112 = prod a 

5113 = prod b 


;rax = ali] 
;update sum a 
;update prod a 
;rax - b[i] 
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add r11,rax ;update sum b 

imul r13,rax ;update prod b 

add rbx,8 ;set ebx to next element 
dec r8d ;adjust count 

jnz GB ;repeat until done 


; Save the final results 


mov [r9],r10 ;save sum a 

mov rax,[rbp+RBP_RA+40] ;rax = ptr to sum b 

mov [rax],r11 ;save sum b 

mov rax,[rbp+RBP_RA+48] ;rax = ptr to prod a 

mov [rax],r12 ;save prod a 

mov rax,[rbp+RBP_RA+56] ;rax = ptr to prod b 

mov [rax],r13 ;save prod b 

mov eax,1 ;set return code to true 


; Function epilog 
Done: lea rsp,[rbp+STK LOCAL1+STK PAD] ;restore rsp 
pop r13 ;restore non-volatile GP registers 
pop r12 
pop rbx 
pop rbp 
ret 


Error: xor eax,eax ;set return code to false 
jmp Done 

Cc2 endp 
end 


Similar to the previous example of this section, the purpose of the code C++ in Listing 5-10 is to prepare 
a simple test case in order to exercise the assembly language function Cc2 . In this example, the function 
Cc2 calculates the sums and products of two 64-bit signed integer arrays. The results are then streamed to 
cout. 

Toward the top of the assembly language code is a series of named constants that control how much 
stack space is allocated in the prolog of function Cc2 . Like the previous example, the function Cc2_ includes 
the frame attribute as part of its proc statement to indicate that it uses a stack frame pointer. A series of push 
instructions saves non-volatile registers RBP, RBX, R12, and R13 on the stack. Note that a . pushreg directive 
is used following each push instruction, which instructs the assembler to add information about each push 
instruction to the Visual C++ runtime exception handling tables. 

Asub rsp,STK TOTAL instruction allocates space on the stack for local variables, and the required 
.allocstack STK TOTAL directive follows next. Register RBP is then initialized as the function's stack frame 
pointer using an lea rbp,[rsp«STK LOCAL2] instruction, which sets RBP equal to rsp + STK LOCAL2. 
Figure 5-4 illustrates the layout of the stack following execution of the lea instruction. Positioning RBP 
so that it "splits" the local stack area into two sections enables the assembler to generate machine code 
that's slightly more efficient since a larger portion of the local stack area can be referenced using signed 
8-bit instead of signed 32-bit displacements. It also simplifies saving and restoring the non-volatile XMM 
registers, which is discussed later in this section. Following the lea instruction isa .setframe rbp,STK 
LOCAL2 directive that enables the assembler to properly configure the runtime exception handling tables. 
Note that the size parameter of this directive must be an even multiple of 16 and less than or equal to 240. 
The .endprolog directive signifies the end ofthe prolog for function Cc2 . 
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[owes 
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" 
3 


Stack 
ia = Undefined 


+72 Registers 


STK_LOCAL1 


STK_TOTAL 


Low 
Memory 


STK_LOCAL2 


Figure 5-4. Stack layout and argument registers following execution of the lea rbp, [rsp+STK_LOCAL2] 
instruction in function Cc2_ 


The next code block contains instructions that initialize the local variables on the stack. These 
instructions are for demonstration purposes only. Note that this block uses a vmovdqa [rbp-16],xmm5 
instruction (Move Aligned Packed Integer Values), which requires its destination operand to be aligned on 
a 16-byte boundary. This instruction embodies the calling convention's mandatory alignment of the RSP 
register to a 16-byte boundary. Following initialization of the local variables, the argument registers are 
saved to their home locations, also merely for demonstration purposes. 

The logic of the main processing loop is straightforward. Following validation of argument value n, the 
function Cc2 initializes the intermediate values sum a (R10) and sum b (R11) to 0, and prod a (R12) and 
prod b (R13) to 1. It then calculates the sum and product of the input arrays a and b. The final results are 
saved to the memory locations specified by the caller. Note that the pointers for sum b, prod a, and prod b 
were passed to Cc2_ using the stack. 

The epilog of function Cc2_ begins with a lea rsp, [rbp+STK_LOCAL1+STK_PAD] instruction that restores 
register RSP to the value it had just after the push r13 instruction in the prolog. When restoring RSP in an 
epilog, the Visual C++ calling convention specifies that either a lea rsp,[RFP+X] or add rsp,X instruction 
must be used, where RFP denotes the frame pointer register and X is a constant value. This limits the number 
of instruction patterns that the runtime exception handler must identify. The subsequent pop instructions 
restore the non-volatile general-purpose registers prior to execution of the ret instruction. According to the 
Visual C++ calling convention, function epilogs must be void of any processing logic including the setting 
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of a return value since this simplifies the amount of processing that's needed within the Visual C++ runtime 
exception handler. You'll learn more about the requirements for function epilogs later in this chapter. The 
output for example Ch05 10 is the following: 


Results for Cc2 


p php pp ph 

v OU 0 €) €) £D 
N 

cr c c SES 

OR 00b U1 Ww 


uid WW NN KR © 


sum a = 18 sum b = 22 
prod a = 10080 prod b = -30240 


Using Non-Volatile XMM Registers 


Earlier in this chapter, you learned how to use the volatile XMM registers to perform scalar floating-point 
arithmetic. The next source code example, Ch05 11, illustrates the prolog and epilog conventions that 
must be observed in order to use the non-volatile XMM registers. Listing 5-11 shows the C++ and assembly 
language source code for example Ch05 11. 


Listing 5-11. Example Ch05 11 


#include "stdafx.h" 
#include <iostream> 
#include «iomanip» 
#define USE MATH DEFINES 
#include «math.h» 


using namespace std; 


extern "C" bool Cc3 (const double* r, const double* h, int n, double* sa cone, double* vol 
cone); 


int main() 

( 
const int n - 7; 
double r[n] = { 1, 1, 2, 2, 3, 3, 4.25 }; 
double h[n] = { 1, 2, 3, 4, 5, 10, 12.5 }; 


double sa conei[n], sa cone2[n]; 
double vol conei[n], vol cone2[n]; 
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// Calculate surface area and volume of right-circular cones 

for (int i = 0; i < n; i+) 

{ 
sa conei[i] = M PI * rfi] * (x[i] + sqrt(r[i] * rfi] + h[i] * h[i]); 
vol conei[i] = M PI * rfi] * r[i] * h[i] / 3.0; 


} 
Cc3 (r, h, n, sa cone2, vol cone2); 


cout «« fixed; 
cout << "Results for Cc3\n\n"; 


const int w = 14; 
const char nl = ‘\n'; 


const char sp = ' '; 


for (int i = 0; i < n; i+) 


{ 
cout << setprecision(2); 
cout << "r/h: " << setw(w) << r[i] << sp; 
cout << setw(w) << h[i] << nl; 
cout << setprecision(6); 
cout << "sa: " << setw(w) << sa_cone1[i] << sp; 
cout << setw(w) << sa cone2[i] << nl; 
cout << "vol: " << setw(w) << vol cone1[i] << sp; 
cout << setw(w) << vol cone2[i] << nl; 
cout << nl; 
} 
return 0; 
} 
> mm 
E Cho5 11.asm 


3 


; extern "C" bool Cc3_(const double* r, const double* h, int n, double* sa_cone, double* 
vol_cone) 


; Named expressions for constant values 


; NUM PUSHREG = number of prolog non-volatile register pushes 


3 STK_LOCAL1 = size in bytes of STK_LOCAL1 area (see figure in text) 
3 STK LOCAL2 - size in bytes of STK LOCAL2 area (see figure in text) 
; STK PAD - extra bytes (0 or 8) needed to 16-byte align RSP 

; STK TOTAL - total size in bytes of local stack 

; RBP RA - number of bytes between RBP and ret addr on stack 
NUM PUSHREG =7 

STK LOCAL1 = 16 
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STK LOCAL2 - 64 
STK PAD - ((NUM PUSHREG AND 1) XOR 1) * 8 
STK TOTAL = STK LOCAL1 + STK LOCAL2 + STK PAD 
RBP RA = NUM PUSHREG * 8 + STK LOCAL1 + STK PAD 
.const 
r8 3pO real8 3.0 
r8 pi real8 3.14159265358979323846 
.code 
Cc3 proc frame 


; Save non-volatile registers on the stack. 


push rbp 
.pushreg 
push rbx 
.pushreg 
push rsi 
.pushreg 
push r12 
.pushreg 
push r13 
.pushreg 
push r14 
.pushreg 
push r15 
.pushreg 


rbp 
rbx 
rsi 
r12 
r13 
r14 


r15 


; Allocate local stack space and initialize frame pointer 


sub rsp,STK TOTAL 


;allocate local stack space 


.allocstack STK TOTAL 


lea rbp,[rsp+STK LOCAL2] 


;rbp = stack frame pointer 


.setframe rbp,STK LOCAL2 


; Save non-volatile registers XMM12 - XMM15. Note that STK LOCAL2 must 


; be greater than or equal to the number of XMM register saves times 16. 
vmovdqa xmmword ptr 
.savexmm128 xmm12,48 
vmovdqa xmmword ptr 
.savexmm128 xmm13,32 
vmovdqa xmmword ptr 
.savexmm128 xmm14,16 
vmovdqa xmmword ptr 
.savexmm128 xmm15,0 


.endprolog 


[rbp-STK LOCAL2448], xmm12 
[xbp-STK LOCAL2432], xmm13 
[rbp-STK_LOCAL2+16] , xmm14 


[xbp-STK LOCAL2], xmm15 


; Access local variables on the stack (demonstration only) 


mov qword ptr [rbp],-1 
mov qword ptr [rbp+8],-2 


;LocalVar1A = -1 
;LocalVar1B = -2 


; Initialize the processing loop variables. Note that many of the 
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; register initializations below are performed merely to illustrate 
; use of the non-volatile GP and XMM registers. 


QQ: 


mov esi,r8d 
test esi,esi 
jg er 


xor 
jmp 


eax,eax 
done 


Xor 
mov 
mov 


rbx,rbx 

112,rcx 

1r13,rdx 

mov r14,r9 

mov r15,[rbp+RBP_RA+40] 

vmovsd xmm14,real8 ptr [r8 pi] 
vmovsd xmm15,real8 ptr [r8 3p0] 


; Calculate cone surface areas and volumes 
; sa = pi * x * (x + sqrt(x * x +h * h)) 


; vol = pi * r 


QQ: 


Er * hed 3 


vmovsd xmmO,real8 ptr [r12+rbx] 
vmovsd xmm1,real8 ptr [r13+rbx] 
vmovsd xmm12,xmm12,xmmO 

vmovsd xmm13,xmm13,xmm1 

vmulsd xmmo,xmmo,xmmo ;xmmo 
vmulsd xmm1, xmm1, xmm1 ;xmm1 
vaddsd xmmo,xmmo, xmm1 ;xmmo 
vsqrtsd xmmo, xmmo, xmmo ;xmmo 
vaddsd xmmO,xmmO, xmm12 ;xmmo 
vmulsd xmmo,xmmo,xmm12 ;xmmo 
vmulsd xmmo,xmmo,xmm14 ;xmmo 
vmulsd xmm12,xmm12,xmm12 3 Xmm 
vmulsd xmm13,xmm13,xmm14 3 Xmm 
vmulsd xmm13,xmm13,xmm12 3 Xmm 
vdivsd xmm13,xmm13,xmm15 3 Xmm 
vmovsd real8 ptr [r14+rbx],xmmO 


vmovsd real8 ptr [r15+rbx],xmm13 


add rbx,8 
dec esi 


jnz GB 


mov eax,1 


; Restore non-volatile XMM registers 


Done: 
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vmovdqa xmm12,xmmword ptr [rbp-STK 
vmovdqa xmm13,xmmword ptr [rbp-STK 


;esi-n 
jis n > 0? 
;jump if n> 0 


;set error return code 


;rbx = 
;112 - 
;113 - 


array element offset 
ptr to r 

ptr to h 

3114 = ptr to sa cone 

;115 = ptr to vol cone 
3xmm14 = pi 

;xmm15 = 3.0 


;xmmo = r 
;xmmi = h 
;xmm12 - r 
;xmm13 - h 


-I*r 
-h*h 
=r*r+h*h 


= sqrt(r * r + h * h) 

= ry + sqrt(r * r +h * h) 

= r * (r + sqrt(r * r + h * h)) 

= pi * r * (r + sqrt(r * r +h * h)) 


EE 

h * pi 

pi *r*r*h 
p*r*rrx*nh/s 


12 = 
13 = 
13 = 
13 = 


;save surface area 
;save volume 


;set rbx to next element 
;update counter 

;repeat until done 

;set success return code 


LOCAL2+48] 
LOCAL2+32] 
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vmovdqa xmm14,xmmword ptr [rbp-STK LOCAL2+16] 
vmovdqa xmm15,xmmword ptr [rbp-STK LOCAL2] 


; Function epilog 
lea rsp,[rbp+STK LOCAL1+STK PAD] ;restore rsp 
pop r15 ;restore NV GP registers 
pop r14 
pop r13 
pop r12 
pop rsi 
pop rbx 
pop rbp 
ret 


Cc3 endp 
end 


The C++ code of example Ch05 11 contains code that calculates the surface area and volume of right- 
circular cones. It also exercises an assembly language function named Cc3 , which performs the same 
surface area and volume calculations. The following formulas are used to calculate a cone's surface area and 
volume: 


sa = arr + Vr? + i) 


vol = xr°h/3 


The function Cc3 begins by saving the non-volatile general-purpose registers that it uses on the stack. 
Itthen allocates the specified amount of local stack space and initializes RBP as the stack frame pointer. 
The next code block saves non-volatile registers XMM12-XMM15 on the stack using a series of vmovdqa 
instructions. A . savexmm128 directive must be used after each vmovdqa instruction. Like the other prolog 
directives, the . savexmm128 directive instructs the assembler to store information regarding an XMM register 
save operation in its exception handling tables. The offset argument of a . savexmm128 directive represents 
the displacement of the saved XMM register on the stack relative to the RSP register. Note that the size of 
STK LOCAL2 must be greater than or equal to the number of saved XMM registers multiplied by 16. Figure 5-5 
illustrates the layout of the stack following execution of the vnovdqa xmmword ptr [rbp-STK LOCAL2],xmm15 
instruction. 
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STK_LOCAL2 
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[M = Undefined 


Figure 5-5. Stack layout and argument registers following execution of the vmovdqa xmmword ptr 
[rbp-STK LOCAL2], xmm15 instruction in function Cc3_ 


Following the prolog, local variables LocalVar1A and LocalVar1B are accessed for demonstration 
purposes only. Initialization of the registers used by the main processing loop occurs next. Note that many of 
these initializations are either suboptimal or superfluous; they are performed merely to elucidate use of the 
non-volatile and general-purpose and XMM registers. Calculation of the cone surface areas and volumes is 
then carried out using AVX double-precision floating-point arithmetic. 

Subsequent to the completion of the processing loop, the non-volatile XMM registers are restored 
using a series of vmovdqa instructions. The function Cc3_ then releases its local stack space and restores the 
previously saved non-volatile general-purpose registers that it used. Here is the output for example ChO5_11. 
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Results for Cc3 


r/h: 


Sa: 


vol: 


r/h: 


Sa: 


vol: 


r/h: 


Sa: 


vol: 


r/h: 


Sa: 


vol: 


rhs 


Sa: 


vol: 


r/h: 


Sa: 


vol: 


r/h: 


Sa: 


vol: 


Macros for Prologs and Epilogs 


126. 
-247780 


233. 
236. 


1.00 


-584476 
-047198 


1.00 


. 166407 
-094395 


2.00 


«220717 
- 566371 


2.00 


- 665630 
-755161 


3.00 


-229761 
.123890 


3.00 
671905 


4.25 
025028 
431572 


126. 
-247780 


233 


1.00 


-584476 
-047198 


2.00 


. 166407 
-094395 


3.00 


«220717 
.566371 


4.00 


.665630 
-755161 


5.00 


229761 
.123890 


10.00 
671905 


12.50 


.025028 
236. 


437572 
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The purpose of the previous three source code examples was to elucidate use of the Visual C++ calling 
convention for 64-bit non-leaf functions. The calling convention's rigid requirements for function prologs 
and epilogs are somewhat lengthy and a potential source of programming errors. It is important to recognize 
that the stack layout of a non-leaf function is primarily determined by the number of non-volatile (both 
general-purpose and XMM) registers that must be preserved and the amount of local stack storage space 
that's needed. A method is needed to automate most of the coding drudgery associated with the calling 
convention. 


Listing 5-12 shows the C++ and assembly language source code for example Ch05 12. This source code 


example demonstrates how to use several macros that I've written to simplify the coding of a prolog and 
epilog in a non-leaf function. It also illustrates how to call a C++ library function. 
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Listing 5-12. Example Ch05_12 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 
#include <cmath> 


using namespace std; 


extern "C" bool Cc4 (const double* ht, const double* wt, int n, double* bsa1, 


double* bsa3); 


int 


{ 


160 


main() 


const int n = 6 
const double ht 
const double wt 


[n] = { 50.0, 
double bsa1 a[n] 
] 
] 


, bsa1 b[n]; 
, bsa2 b[n 
3 


double bsa2 a[n [ 
bsa3 b[n 


]; 

double bsa3 a[n 15 

for (int i = 0; i < n; i++) 
{ 

bsa1_a[i] 

bsa2 a[i] 

bsa3 a[i] 


) 


Cc4 (ht, wt, n, bsa1 b, bsa2 


n] = { 150, 160, 170, 180, 190, 200 }; 


60.0, 70.0, 80.0, 90.0, 100.0 }; 


0.007184 * pow(ht[i], 0.725) * pow(wt[i], 0.425); 
0.0235 * pow(ht[i], 0.42246) * pow(wt[i], 0.51456); 
sqrt(ht[i] * wt[i] / 3600.0); 


b, bsa3 b); 


cout << "Results for Cc4_\n\n"; 


cout «« fixed; 


const char sp = ' '; 
for (int i = 0; i < n; i+) 
{ 


cout << setprecision(1); 


cout << "height: " << setw(6) << ht[i] << " cm\n"; 


cout << "weight: 
cout << setprecision(6); 


cout << "BSA (C++): "3 


<< setw(6) << wt[i] << " kg\n"; 


cout << setw(10) << bsa1 a[i] << sp; 
cout << setw(10) << bsa2 a[i] << sp; 


cout << setw(10) << bsa3. 


a[i] << " (sq. m)"; 


double* bsa2, 
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cout << "BSA (X86-64): "; 

cout << setw(10) << bsai b[i] << sp; 

cout «« setw(10) «« bsa2 b[i] «« sp; 

cout << setw(10) << bsa3 b[i] << " (sq. m)\n\n"; 
) 


return 0; 


3 


$ Ch05_12.asm 


3 


; extern "C" bool Cc4 (const double* ht, const double* wt, int n, double* bsa1, double* 
bsa2, double* bsa3); 


include «MacrosX86-64-AVX.asmh» 


.const 

r8 0p007184 real8 0.007184 
r8 0p725 real8 0.725 

r8 0p425 real8 0.425 

r8 0p0235 real8 0.0235 
r8 0p42246 real8 0.42246 
r8 0p51456 real8 0.51456 
r8 3600p0 real8 3600.0 


.code 
extern pow:proc 


Cc4_ proc frame 
_CreateFrame Cc4 ,16,64,rbx,rsi,r12,113,114,115 
_SaveXmmRegs xmm6, xmm7 , xmm8 , xmm9 
 EndProlog 


; Save argument registers to home area (optional). Note that the home 
; area can also be used to store other transient data values. 

mov qword ptr [rbp+Cc4 OffsetHomeRCX],rcx 

mov qword ptr [rbp+Cc4 OffsetHomeRDX],rdx 

mov qword ptr [rbp+Cc4 OffsetHomeR8],r8 

mov qword ptr [rbp+Cc4 OffsetHomeR9],r9 


; Initialize processing loop pointers. Note that the pointers are 
; maintained in non-volatile registers, which eliminates reloads 
; after the calls to pow(). 


test r8d,r8d jis n > 0? 

jg @F ;jump if n» 0 

xor eax,eax ;set error return code 
jmp Done 
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@@: mov [rbp],r8d ;save n to local var 
mov r12,rcx ;112 - ptr to ht 
mov r13,rdx ;113 = ptr to wt 
mov r14,r9 ;114 = ptr to bsai 
mov r15,[rbp4Cc4 OffsetStackArgs] 3115 = ptr to bsa2 
mov rbx,[rbp+Cc4 OffsetStackArgs+8] ;rbx = ptr to bsa3 
xor rsi,rsi ;array element offset 


; Allocate home space on stack for use by pow() 
sub rsp,32 


; Calculate bsa1 = 0.007184 * pow(ht, 0.725) * pow(wt, 0.425); 
@Q: vmovsd xmmO,real8 ptr [r12+rsi] ;xmmO = height 
vmovsd xmm8, xmm8 , xmmO 
vmovsd xmm1,real8 ptr [r8 0p725] 
call pow 3xmmO = pow(ht, 0.725) 
vmovsd xmm6, xmmé6 , xmmO 


vmovsd xmmO,real8 ptr [r13+rsi] ;xmmo 
vmovsd xmm9, xmm9 , xmmO 
vmovsd xmm1,real8 ptr [r8 0p425] 


weight 


call pow ;xmmO = pow(wt, 0.425) 
vmulsd xmm6,xmm6,real8 ptr [r8 0p007184] 
vmulsd xmm6, xmmé6 , xmmO ;xmm6 = bsat 


; Calculate bsa2 = 0.0235 * pow(ht, 0.42246) * pow(wt, 0.51456); 


vmovsd xmmo,xmmo,xmm8 ;xmmo = height 
vmovsd xmm1,real8 ptr [r8 0p42246] 
call pow ;xmmO = pow(ht, 0.42246) 


vmovsd xmm],xmm7 , xmmo 


vmovsd xmmo,xmmo , xmm9 ;xmmO = weight 

vmovsd xmm1,real8 ptr [r8 0p51456] 

call pow ;xmmO = pow(wt, 0.51456) 
vmulsd xmm7,xmm7,real8 ptr [r8 0po235] 

vmulsd xmm7,xmm7,xmmo ;xmm7 - bsa2 


; Calculate bsa3 - sqrt(ht * wt / 60.0); 
vmulsd xmm8, xmm8 , xmm9 
vdivsd xmm8,xmm8,real8 ptr [r8 3600po] 
vsqrtsd xmm8,xmm8,xmm8 ;xmm8 - bsa3 


; Save BSA results 


vmovsd real8 ptr [r14+rsi],xmmé ;save bsa1 result 

vmovsd real8 ptr [r15+rsi],xmm7 ;save bsa2 result 

vmovsd real8 ptr [rbx«rsi],xmm8 ;save bsa3 result 

add rsi,8 ;update array offset 

dec dword ptr [rbp] jn=n-1 

jnz @B 

mov eax,1 ;set success return code 
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Done: 4 RestoreXmmRegs xmm6,xmm7, xmm8 , xmm9 
_DeleteFrame rbx,rsi,r12,r13,r14,r15 
ret 

Cc4 endp 
end 


The purpose of the code in main is to initialize several test cases and exercise the assembly language 
function Cc4 . This function computes estimates of human body surface area (BSA) using several well- 
known equations. These equations are defined in Table 5-4. In this table, each equation uses the symbol H 
for height in centimeters, Wfor weight in kilograms, and BSA for body surface area in square meters. 


Table 5-4. Body Surface Area Equations 


Formula Equation 
DuBois and DuBois BSA = 0.007184 x H?7?5 x W0425 
Gehan and George BSA = 0.0235 x H942246 x 14051456 


Mosteller BSA = 4 H xW /3600 


The assembly language code for example Ch05 12 begins with an include statement that incorporates 
the contents of the file MacrosX86-64-AVX. asmh. This file (source code not shown but included with the 
Chapter 5 download package) contains a number of macros that help automate much of the coding grunt 
work that's associated with the Visual C++ calling convention. A macro is an assembler text substitution 
mechanism that enables a programmer to represent a sequence of assembly language instructions, data 
definitions, or other statements using a single text string. Assembly language macros are typically employed 
to generate sequences of instructions that will be used more than once. Macros are also frequently used to 
avoid the performance overhead of a function call. Source code example Ch05 12 demonstrates the use of 
the calling convention macros. You learn how to define your own macros later in this book. 

Figure 5-6 shows a generic stack layout diagram for a non-leaf function. Note the similarities between 
this figure and the more detailed stack layouts of Figures 5-4 and 5-5. The macros defined in MacrosX86-64- 
AVX.asmh assume that a function's basic stack layout will conform to what's shown in Figure 5-6. They enable 
a function to tailor its own detailed stack frame by specifying the amount of local stack space that's needed 
and which non-volatile registers must be preserved. The macros also perform most of the required stack 
offset calculations, which reduces the risk of a programming error in the prolog or epilog. 
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High 
Memory 


Saved Non-Volatile General- 
Purpose Registers 


Stack Pad (if necessary) 


Locali Variables 


R9 
R8 
RDX 
RCX 


XMM3 
XMM2 
Stack Arguments XMM1 
XMMO 


Register Arguments 


StkSizeLocal1 


*-RBP 


Local2 Variables (optional) 
Saved Non-Volatile XMM 
Low Registers 
Memory RSP 


Figure 5-6. Generic stack layout for a non-leaf function 


StkSizeTotal 


StkSizeLocal2 


Returning to the assembly code, immediately after the include statement is a .const section that 
contains definitions for the various floating-point constant values used in the BSA equations. The line 
extern pow:proc enables use of the external C++ library function pow. Following the Cc4_ proc frame 
statement, the macro  CreateFrame is used to generate the code that initializes the function's stack frame. 
It also saves the specified non-volatile general-purpose registers on the stack. The macro requires several 
additional parameters, including a prefix string and the size in bytes of StkSizeLocal1 and StkSizeLocal2 
(see Figure 5-6). The macro CreateFrame uses the specified prefix string to create symbolic names that 
can be employed to reference items on the stack. It's somewhat convenient to use a shortened version of 
the function name as the prefix string but any unique text string can be used. Both StkSizeLocal1 and 
StkSizeLocal2 must be evenly divisible by 16. StkSizeLocal2 must also be less than or equal to 240, and 
greater than or equal to the number of saved XMM registers multiplied by 16. 

The next statement uses the SaveXmmRegs macro to save the specified non-volatile XMM registers to 
the XMM save area on the stack. This is followed by the EndProlog macro, which signifies the end of the 
function's prolog. Subsequent to the completion of the prolog, register RBP is configured as the function's 
stack frame pointer. It is also safe to use any of the saved non-volatile general-purpose or XMM registers 
subsequentto the EndProlog macro. 

The block of instructions that follows EndProlog saves the argument registers to their home locations 
on the stack. Note that each mov instruction includes a symbolic name that equates to the offset of the 
register's home area on the stack relative to the RBP register. The symbolic names and the corresponding 
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offset values were automatically generated by the CreateFrame macro. The home area can also be used to 
store temporary data instead of the argument registers, as mentioned earlier in this chapter. 

Initialization of the processing loop variables occurs next. The value n in register R8D is checked for 
validity and saved on the stack as a local variable. Several non-volatile registers are then initialized as pointer 
registers. Non-volatile registers are used in order to avoid register reloads following each call to the C++ 
library function pow. Note that the pointer to array bsa2 is loaded from the stack using a mov r15,[rbp+Cc4_ 
OffsetStackArgs] instruction. The symbolic constant Cc4 OffsetStackArgs also was automatically 
generated by the macro CreateFrame and equates to the offset of the first stack argument relative to the RBP 
register. A mov rbx,[rbp4Cc4 0ffsetStackArgs+8] instruction loads argument bsa3 into register RBX; the 
constant 8 is included as part of the source operand displacement since bsa3 is the second argument passed 
via the stack. 

The Visual C++ calling convention requires the caller of a function to allocate that function's home area 
on the stack. The sub rsp,32 instruction performs this operation. The ensuing block of code calculates the 
BSA values using the equations shown in Table 5-4. Note that registers XMMO and XMMI are loaded with 
the necessary argument values prior to each call to pow. Also note that some of the return values from pow are 
preserved in non-volatile XMM registers prior to their actual use. 

Following completion of the BSA processing loop is the epilog for Cc4 . Before execution of the ret 
instruction, the function must restore all non-volatile XMM and general-purpose registers that it saved in 
the prolog. The stack frame must also be properly deleted. The RestoreXmmRegs macro restores the non- 
volatile XMM registers. Note that this macro requires the order of the registers in its argument list to match 
the register list that was used with the SaveXmmRegs macro. Stack frame cleanup and general-purpose 
register restores are handled by the DeleteFrame macro. The order of the registers specified in this macro's 
argument list must be identical to the prolog's CreateFrame macro. The DeleteFrame macro also restores 
register RSP from RBP, which means that it's not necessary to include an explicit add rsp, 32 instruction to 
release the home area allocated on the stack for pow. Here's the output for example Ch05 12. 


Results for Cc4 


height: 150.0 cm 
weight: 50.0 kg 
BSA (C++): 1.432500 1.460836 1.443376 (sq. m) 
BSA (X86-64): | 1.432500 1.460836 1.443376 (sq. m) 


height: 160.0 cm 
weight: 60.0 kg 
BSA (C++): 1.622063 1.648868 1.632993 (sq. m) 
BSA (X86-64): 1.622063 1.648868 1.632993 (sq. m) 


height: 170.0 cm 
weight: 70.0 kg 
BSA (C++): 1.809708 1.831289 1.818119 (sq. m) 
BSA (X86-64): 1.809708 1.831289 1.818119 (sq. m) 


height: 180.0 cm 
weight: 80.0 kg 
BSA (C++): 1.996421 2.009483 2.000000 (sq. m) 
BSA (X86-64): 1.996421 2.009483 2.000000 (sq. m) 


165 


CHAPTER 5 = AVX PROGRAMMING — SCALAR FLOATING-POINT 


height: 
weight: 


BSA (C++): 


190.0 cm 
90.0 kg 


2.182809 2.184365 2.179449 (sq. m) 


BSA (X86-64): 2.182809 2.184365 2.179449 (sq. m) 


height: 
weight: 


200.0 cm 
100.0 kg 
BSA (C++): 


2.369262 2.356574 2.357023 (sq. m) 


BSA (X86-64): 2.369262 2.356574 2.357023 (sq. m) 


Summary 


Here are the key learning points for Chapter 5: 
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The vadds[d|s], vsubs[d|s], vmuls[d|s], vdivs[d|s], and vsqrts[d|s] 
instructions perform basic double-precision and single-precision floating-point 
arithmetic. 


The vmovs[d|s] instructions copy a scalar floating-point value from one XMM 
register to another; they are also used to load/store scalar floating-point values from/ 
to memory. 


The vcoms[d|s] instructions compare two scalar floating-point values and set the 
status flags in RFLAGS to signify the result. 


The vcmps[d|s] instructions compare two scalar floating-point values using a 
compare predicate. If the compare predicate is true, the destination operand is set to 
all ones; otherwise, it is set to all zeros. 


The vcvts[d|s]2si instructions convert a scalar floating-point value to a signed 
integer value; the vcvtsi2s[d|s] instructions perform the opposite conversion. 


The vcvtsd2ss instruction converts a scalar double-precision floating-point value to 
single-precision; the vcvtss2sd instruction performs the opposite conversion. 


The vldmxcsr instruction loads a value into the MXCSR register; the vstmxcsr 
instruction saves the current contents of the MXCSR register. 


Leaf functions can be used for simple processing tasks and do not require a prolog or 
epilog. A non-leaf function must use a prolog and epilog to save and restore non- 
volatile registers, initialize a stack frame pointer, allocate local storage space on the 
stack, or call other functions. 


CHAPTER 6 


AVX Programming - Packed 
Floating-Point 


The source code examples of the previous chapter elucidated the fundamentals of AVX programming using 
scalar floating-point arithmetic. In this chapter, you'll learn how to use the AVX instruction set to perform 
operations using packed floating-point operands. The chapter begins with three source code examples that 
demonstrate common packed floating-point operations, including basic arithmetic, data comparisons, 

and data conversions. The next set of source code examples illustrate how to carry out SIMD computations 
using floating-point arrays. The final two source code examples explain how to use the AVX instruction set to 
accelerate matrix transposition and multiplication. 

In Chapter 4 you learned that AVX supports packed floating-point operations using either 128-bit or 
256-bit wide operands. All of the source code examples in this chapter use 128-bit wide packed floating- 
point operands, both single-precision and double-precision, and the XMM register set. You'll learn how to 
use 256-bit wide packed floating-point operands and the YMM register set in Chapter 9. 


Packed Floating-Point Arithmetic 


Listing 6-1 shows the source code for example Ch06 01, which demonstrates how to perform common 
arithmetic operations using packed single-precision and double-precision floating-point operands. It also 
highlights proper alignment techniques for packed floating-point operands in memory. 


Listing 6-1. Example Ch06 01 


#pragma once 

#include «string» 
#include «cstdint» 
#include <sstream> 
#include <iomanip> 


struct XmmVal 
{ 
public: 

union 
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int8 t m I8[16]; 


int16 t m I16[8]; 
int32 t m I32[4]; 
int64 t m I64[2]; 
uint8 t m U8[16]; 
uint16 t m U16[8]; 
uint32 t m U32[ 
uint64 t m Ue4[ 
float m F32[4]; 


double m F64[2]; 


#include "stdafx.h" 
#include «iostream» 
#define USE MATH DEFINES 
#include «math.h» 

#include "XmmVal.h" 


using namespace std; 


extern "C" void AvxPackedMathF32 (const XmmVal& a, 
extern "C" void AvxPackedMathF64 (const XmmVal& a, 


4]; 
2 


if 
]; 


void AvxPackedMathF32 (void) 


( 


alignas(16) XmmVal a; 
alignas(16) XmmVal b; 
alignas(16) XmmVal c[8]; 


WNP © 
e a rn 
(ll 


2.0f; 


36.0f; 
(float)(1.0 / 32.0); b.m F32[1 


42.0f; 


b.m F32[0 


] 
] 
b.m F32[2] 
b.m F32[3] 


AvxPackedMathF32 (a, b, c); 


cout 
cout 
cout 
cout 
cout 
cout 
cout 
cout 
cout 
cout 
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<< 
<< 
<< 
<< 
<< 
<< 
<< 
<< 
<< 
<< 


("\nResults 


a: 
"b: 
ns 
"addps: 
"subps: 
"mulps: 
"divps: 
"absps b: 
"sqrtps a: 


for AvxPackedMathF32\n"); 
<< a.ToStringF32() << ‘\n'; 
<< b.ToStringF32() << ‘\n'; 


<< C ToStringF32() << '\n' 
«« c ToStringF32() << '\n' 
«« c ToStringF32() << '\n' 


<< C 
<< C 


ToStringF32() << '\n' 


[o]. 
[1]: 
[2]. 
<< c[3].ToStringF32() << '\n' 
[4]. 
[5].ToStringF32() << '\n' 


const XmmVal& b, XmmVal c[8]); 
const XmmVal& b, XmmVal c[8]); 


-(float)(1.0 / 9.0); 
64.0f; 

-0.0625f; 

8.666667f; 


3 
3 
3 
3 
3 
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cout << "minps:  " << c[6].ToStringF32() << ‘\n'; 
cout << "maxps: " << c[7].ToStringF32() << ‘\n'; 
} 
void AvxPackedMathF64(void) 
{ 
alignas(16) XmmVal a; 
alignas(16) XmmVal b; 
alignas(16) XmmVal c[8]; 
a.m F64[0] = 2.0; b.m F64[0] = M E; 
a.m F64[1] = M PI; b.m F64[1] = -M 1 PI; 
AvxPackedMathF64 (a, b, c); 
cout << ("\nResults for AvxPackedMathF64\n"); 
cout << "a: " << a.ToStringF64() << '\n'; 
cout «« "b: " << b.ToStringF64() << 'Wn'; 
cout << ‘\n'; 
cout << "addpd: " << c[0].ToStringF64() << ‘\n'; 
cout << "subpd: " << c[1].ToStringF64() << ‘\n'; 
cout << "mulpd: " << c[2].ToStringF64() << ‘\n'; 
cout << "divpd: " << c[3].ToStringF64() << ‘\n'; 
cout << "abspd b: " << c[4].ToStringF64() << ‘\n'; 
cout << "sqrtpd a:" << c[5].ToStringF64() << ‘\n'; 
cout << "minpd: " << c[6].ToStringF64() << ‘\n'; 
cout << "maxpd: " << c[7].ToStringF64() << ‘\n'; 
} 
int main() 
{ 
AvxPackedMathF32(); 
AvxPackedMathF64(); 
return 0; 
) 
> wee ee ee ee eee ee ee ee ee ae ee ee ee ee ee ee ee ee ee eee ee ee ee 
s Cho6_01.asm 
> wee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee eee 
.const 
align 16 


AbsMaskF32 dword 7fffffffh, 7FFfffffh, 7FFFFFffh, 7FFFFFFfh 


AbsMaskF64 qword 7fFFffffffFFFFFFFh, 7FFFEFEFFFEFFFFER 


;Absolute value mask for SPFP 
;Absolute value mask for DPFP 


; extern "C" void AvxPackedMathF32 (const XmmVal& a, const XmmVal& b, XmmVal c[8]); 


.code 
AvxPackedMathF32 proc 
; Load packed SPFP values 
vmovaps xmmO,xmmword ptr [rcx] 
vmovaps xmm1,xmmword ptr [rdx] 


;xmmo 
;xmm1 


£D 
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; Packed SPFP addition 
vaddps xmm2,xmmO, xmm1 
vmovaps [r8+0],xmm2 


; Packed SPFP subtraction 
vsubps xmm2,xmmo, xmm1 
vmovaps [r8+16],xmm2 


; Packed SPFP multiplication 
vmulps xmm2,xmmo, xmm2 
vmovaps [r8+32],xmm2 


; Packed SPFP division 
vdivps xmm2,xmmO, xmm1 
vmovaps [r8+48],xmm2 


; Packed SPFP absolute value (b) 
vandps xmm2,xmm1,xmmword ptr [AbsMaskF32] 
vmovaps [r8+64],xmm2 


; Packed SPFP square root (a) 
vsqrtps xmm2,xmmo 
vmovaps [r8+80],xmm2 


; Packed SPFP minimum 
vminps xmm2,xmmo, xmm1 
vmovaps [r8+96],xmm2 


; Packed SPFP maximum 
vmaxps xmm2,xmmo, xmm1 
vmovaps [r8+112],xmm2 
ret 
AvxPackedMathF32  endp 


; extern "C" void AvxPackedMathF64 (const XmmVal& a, const XmmVal& b, XmmVal c[8]); 


AvxPackedMathF64_ proc 

; Load packed DPFP values 
vmovapd xmmO,xmmword ptr [rcx] ;xmmo 
vmovapd xmm1,xmmword ptr [rdx] ;xmm1 


Wow 
o v 


; Packed DPFP addition 
vaddpd xmm2,xmmO, xmm1 
vmovapd [r8+0],xmm2 


; Packed DPFP subtraction 


vsubpd xmm2,xmmO, xmm1 
vmovapd [r8+16],xmm2 
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; Packed DPFP multiplication 
vmulpd xmm2,xmmO, xmm1 
vmovapd [r8+32],xmm2 


; Packed DPFP division 
vdivpd xmm2,xmmO,xmm1 
vmovapd [r8+48] ,xmm2 


; Packed DPFP absolute value (b) 
vandpd xmm2,xmm1,xmmword ptr [AbsMaskF64] 
vmovapd [r8+64],xmm2 


; Packed DPFP square root (a) 
vsqrtpd xmm2,xmmo 
vmovapd [r8+80],xmm2 


; Packed DPFP minimum 
vminpd xmm2,xmmO, xmm1 
vmovapd [r8+96],xmm2 


; Packed DPFP maximum 
vmaxpd xmm2,xmmO, xmm31 
vmovapd [r8+112],xmm2 
ret 
AvxPackedMathF64  endp 
end 


Listing 6-1 begins with the declaration of a C++ structure named XmmVal that's declared in the header 
file XnmVal.h. This structure contains a publicly-accessible anonymous union that facilitates packed 
operand data exchange between functions written in C++ and x86 assembly language. The members of this 
union correspond to the packed data types that can be used with an XMM register. The structure XmmVal also 
includes several member functions that format and display the contents of an XmmVal variable (the source 
code for these member functions is not shown but included with the chapter download package). 

Near the top of the C++ code are the declarations for the x86-64 assembly language functions 
AvxPackedMath32 and AvxPackedMath64 . These functions carry out ordinary packed arithmetic operations 
using the supplied XmmVal argument values. Note that for both AvxPackedMath32 and AvxPackedMath64 , 
arguments a and b are passed by reference instead of value in order to avoid the overhead of an XmmVal copy 
operation. Using pointers to pass a and b would also work in this example since pointers and references are 
the same from the perspective of the x86-64 assembly language functions. 

Immediately following the assembly language function declarations is the definition for function 
AvxPackedMathF32. This function contains code that demonstrates packed single-precision floating-point 
arithmetic. Note that the XmmVal variables a, b, and c are all defined using the specifier alignas(16), 
which instructs the C++ compiler to align each variable on a 16-byte boundary. The next set of statements 
initializes the arrays a.m F32 and b.m F32 with test values. The C++ code then calls the assembly language 
function AvxPackedMathF32 to perform various arithmetic operations using the packed single-precision 
floating-point operands. The results are then displayed using a series of stream writes to cout. The C++ 
code also contains a function named AvxPackedMath64 that illustrates arithmetic operations using packed 
double-precision floating-point operands. The organization of this function is similar to AvxPackedMath32. 

The x86-64 assembly language code for example Ch06 01 begins with a . const section that defines 
packed mask values for calculating floating-point absolute values. The align 16 statement is a MASM 
directive that instructs the assembler to align the next variable (or instruction) to a 16-byte boundary. Using 
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this statement guarantees that the mask AbsMaskF32 is properly aligned. Note that unlike x86-SSE, x86-AVX 
instruction operands in memory need not be properly aligned except for instructions that explicitly specify 
aligned operands (e.g., vnovaps). However, proper alignment of packed operands in memory is strongly 
recommended whenever possible in order to avoid the performance penalties that can occur when the 
processor accesses an unaligned operand. A second align 16 directive is not necessary to ensure alignment 
of AbsMaskF64 since the size of AbsMaskF32 is 16 bytes, but it would be okay to include such a statement. 

The first instruction of AvxPackedMathF32 , vmovaps xmmO,xmmword ptr [rcx] loads argument a 
(i.e., the four floating-point values stored in XmmVal a) into register XMMO. As mentioned in the previous 
paragraph, the vmovaps (Move Aligned Packed Single-Precision Floating-Point Values) instruction requires 
source operands in memory to be properly aligned. This is why the alignas(16) specifiers were used in the 
C++ code. The operator xmmword ptr directs the assembler to treat the memory location pointer to by RCX 
as a 128-bit operand. In this instance, use of the xmmword ptr operator is optional and employed to improve 
code readability. The ensuing vmovaps xmm1,xmmword ptr [rdx] instruction loads b into register XMMI. 
The vaddps xmm2,xmmo,xmm1 instruction (Add Packed Single-Precision Floating-Point Values) performs 
packed single-precision floating-point addition using the contents of registers XMMO and XMM1. It then 
saves the calculated sum to register XMM2, as shown in Figure 6-1. Note that the vaddps instruction does not 
modify the contents of its two source operands. The vmovaps xmmword ptr [r8],xmm2 that follows saves the 
result of the packed arithmetic addition to c[0]. 


vaddps xmm2,xmmO, xmm1 ;packed SPFP add 


Figure 6-1. Execution of vaddps instruction 


The ensuing vsubps, vmulps, and vdivps instructions carry out packed single-precision floating- 
point subtraction, multiplication, and division. This is followed by a vandps xmm2,xmm1,xmmword ptr 
[AbsMaskF32] instruction that calculates packed absolute values using argument b. The vandps (Bitwise 
AND of Packed Single-Precision Floating-Point Values) instruction performs a bitwise AND using its two 
source operands. Note that all of the bits in each AbsMaskF32 doubleword are set to one except the most 
significant bit, which corresponds to the sign bit of a single-precision floating-point value. A sign bit value of 
zero corresponds to a positive floating-point number as discussed in Chapter 4. Performing a bitwise AND 
using this 128-bit wide mask and the packed single-precision floating-point operand b sets the sign bit of 
each element to zero and generates packed absolute values. 

The remaining instructions in AvxPackedMathF32 calculate packed single-precision floating-point 
square roots (vsqrtps), minimums (vminps), and maximums (vmaxps). The organization of function 
AvxPackedMathF64 is similar to AvxPackedMathF32 .AvxPackedMathF64 carries out its calculations 
using the packed double-precision floating-point versions of the same instructions that are used in 
AvxPackedMathF32 .Here is the output for source code example Ch06 01: 
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a: 36.000000 0.031250 | 2.000000 42.000000 
b: -0.111111 64.000000 | -0.062500 8.666667 
addps: 35.888889 64.031250 | 1.937500 50.666668 
subps: 36.111111 -63.968750 | 2.062500 33.333332 
mulps: -4.000000 2.000000 | -0.125000 364.000000 
divps: -324.000000 0.000488 | -32.000000 4.846154 
absps b: 0.111111 64.000000 | 0.062500 8.666667 
sqrtps a: 6.000000 0.176777 | 1.414214 6.480741 
minps: -0.111111 0.031250 | -0.062500 8.666667 
maxps : 36.000000 64.000000 | 2.000000 42.000000 
Results for AvxPackedMathF64 

a: 2.000000000000 | 3.141592653590 
b: 2.718281828459 | -0.318309886184 
addpd: 4.718281828459 | 2.823282767406 
subpd: -0.718281828459 | 3.459902539774 
mulpd: 5.436563656918 | -1.000000000000 
divpd: 0.735758882343 | -9.869604401089 
abspd b: 2.718281828459 | 0.318309886184 
sqrtpd a: 1.414213562373 | 1.772453850906 
minpd: 2.000000000000 | -0.318309886184 
maxpd: 2.718281828459 | 3.141592653590 


Packed Floating-Point Compares 


In Chapter 5, you learned how to compare scalar single-precision and double-precision floating-point values 
using the vcmps [d | s] instructions. In this section, you'll learn how to compare packed single-precision 

and double-precision floating-point values using the vcmpp[d| s] instructions. Similar to their scalar 
counterparts, the packed compare instructions require four operands: a destination operand, two source 
operands, and an immediate compare predicate. The packed compare instructions signify their results 

using quadword (vcmppd) or doubleword (vcmpps) masks of all zeros (false compare result) or all ones (true 
compare result). Listing 6-2 shows the source code for example Ch06 02. 


Listing 6-2. Example Ch06 02 


#include "stdafx.h" 
#include <iostream> 
#include «iomanip» 
#define USE MATH DEFINES 
include «math.h» 
include «limits» 
#include "XmmVal.h" 
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using namespace std; 


extern "C" void AvxPackedCompareF32 (const XmmVal& a, const XmmVal& b, XmmVal c[8]); 
extern "C" void AvxPackedCompareF64 (const XmmVal& a, const XmmVal& b, XmmVal c[8]); 


const char* c CmpStr[8] - 
{ 


}; 


"EQ", "NE", “LT, "LE", "GT", "GE", "ORDERED", "UNORDERED" 


void AvxPackedCompareF32(void) 
{ 
alignas(16) XmmVal a; 
alignas(16) XmmVal b; 
alignas(16) XmmVal c[8]; 


a.m F32[0] [ 
a.m F32[1] | F32[ 
a.m F32[2] = -6.0; b.m F32[ 
a.m F32[3] [ 


LU 
N 
o 

` 
o 
. 
15 
E 
UJ 
N 


a 
iT 
1 
a 
o 
` 


i} 
co 
O. à 
ve 


for (int i = 0; i < 2; ie) 
{ 
if (i == 1) 
a.m_F32[0] = numeric limits«float»::quiet NaN(); 


AvxPackedCompareF32 (a, b, c); 

cout << "\nResults for AvxPackedCompareF32 (iteration = " << i << ")\n"; 
cout << setw(11) << 'a' << ':' << a.ToStringF32() << '\n'; 

cout << setw(11) << 'b' << ':' << b.ToStringF32() << 'Mn'; 

cout << ‘\n'; 


for (int j = 0; j < 8; j++) 
cout << setw(11) << c_CmpStr[j] << ':' << c[j].ToStringX32() << '\n'; 


} 


void AvxPackedCompareF64(void) 
{ 
alignas(16) XmmVal a; 
alignas(16) XmmVal b; 
alignas(16) XmmVal c[8]; 


a.m F64[0] 


2 b.m F64[0] = M E; 
a.m F64[1] = M 


b oer 
I; b.m F64[1] = -M 1 PI; 


for (int i = 0; i < 2; ie) 
( 

if (1551) 

( 
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a.m F64[0] 
b.m F64[1] 


numeric limits«double»::quiet NaN(); 
a.m F64[1]; 


} 
AvxPackedCompareF64 (a, b, c); 


cout << "\nResults for AvxPackedCompareF64 (iteration = " << i << ")\n"; 
cout << setw(11) << 'a' << ':' << a.ToStringF64() << 'Mn'; 
cout << setw(11) << 'b' << ':' << b.ToStringF64() << 'Mn'; 


cout << ‘\n'; 


for (int j = 0; j < 8; j++) 
cout << setw(11) << c CmpStr[j] << ':' << c[j].ToStringX64() << '\n'; 


} 
} 
int main() 
{ 
AvxPackedCompareF32(); 
AvxPackedCompareF64(); 
return 0; 
} 
E wee ee ew ee ee ee ee ee ee ee ae ee ee ee ee ee ee ee ee ee ee ee eee ee 
; Cho6_02.asm 


include <cmpequ.asmh> 


; extern "C" void AvxPackedCompareF32 (const XmmVal& a, const XmmVal& b, XmmVal c[8]); 


.code 

AvxPackedCompareF32 proc 
vmovaps xmmo, [rcx] ;xmmo = a 
vmovaps xmm1, [rdx] ;xmmi = b 


; Perform packed EQUAL compare 
vcmpps xmm2,xmmO, xmm1,CMP_EQ 
vmovdqa xmmword ptr [18], xmm2 


; Perform packed NOT EQUAL compare 
vcmpps xmm2,xmmO, xmm1,CMP_NEQ 
vmovdga xmmword ptr [r8+16],xmm2 


; Perform packed LESS THAN compare 
vcmpps xmm2,xmmo,xmmi,CMP LT 
vmovdqa xmmword ptr [r8+32],xmm2 


; Perform packed LESS THAN OR EQUAL compare 


vcmpps xmm2,xmmO,xmm1,CMP LE 
vmovdqa xmmword ptr [r8448],xmm2 
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; Perform packed GREATER THAN compare 
vcmpps xmm2,xmmo,xmm1i,CMP GT 
vmovdqa xmmword ptr [r8+64],xmm2 


; Perform packed GREATER THAN OR EQUAL compare 
vcmpps xmm2,xmmO,xmm1,CMP GE 
vmovdqa xmmword ptr [r8480],xmm2 


; Perform packed ORDERED compare 
vcmpps xmm2,xmmO, xmm1,CMP_ORD 
vmovdqa xmmword ptr [r8496],xmm2 


; Perform packed UNORDERED compare 
vcmpps xmm2,xmmo,xmm1,CMP UNORD 
vmovdqa xmmword ptr [r8+112],xmm2 
ret 

AvxPackedCompareF32  endp 


; extern "C" void AvxPackedCompareF64 (const XmmVal& a, const XmmVal& b, XmmVal c[8]); 
AvxPackedCompareF64 proc 


vmovapd xmmo, [rcx] ;xmmo 
vmovapd xmm1, [rdx] ;xmm1 


[IT 
o ov 


; Perform packed EQUAL compare 
vcmppd xmm2,xmmO, xmm1,CMP_EQ 
vmovdqa xmmword ptr [18], xmm2 


; Perform packed NOT EQUAL compare 
vcmppd xmm2,xmmO, xmm1,CMP_NEQ 
vmovdqa xmmword ptr [r8+16],xmm2 


; Perform packed LESS THAN compare 
vcmppd xmm2,xmmO, xmm1,CMP_LT 
vmovdqa xmmword ptr [r8+32],xmm2 


; Perform packed LESS THAN OR EQUAL compare 
vcmppd xmm2,xmmo,xmm1,CMP LE 
vmovdqa xmmword ptr [r8448],xmm2 


; Perform packed GREATER THAN compare 
vcmppd xmm2,xmmo,xmm1,CMP GT 
vmovdqa xmmword ptr [r8464],xmm2 


; Perform packed GREATER THAN OR EQUAL compare 
vcmppd xmm2,xmmo,xmm1,CMP GE 
vmovdqa xmmword ptr [r8+80],xmm2 


; Perform packed ORDERED compare 


vcmppd xmm2,xmmo,xmm1,CMP ORD 
vmovdqa xmmword ptr [r8496],xmm2 
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; Perform packed UNORDERED compare 
vcmppd xmm2,xmmo,xmm1,CMP UNORD 
vmovdqa xmmword ptr [r8+112],xmm2 
ret 

AvxPackedCompareF64  endp 
end 


Figure 6-2 illustrates execution of the vcmpps xmm2 , xmmO , xmm1, 0 and vcmppd xmm2 , xmmo , xmm1 , 1 


instructions. In these examples, the compare predicate operands 0 and 1 test for equality and less than, 
respectively. 


vcmpps xmm2, xmm0, xmml, 0 ;packed SPFP compare for EQ 


iind rum 


vcmppd xrmm2, xmmO, xmm1, 1 ;packed DPFP compare for LT 
OXFFFFFFFFFFFFFFFF 0x0000000000000000 xmmO 


Figure 6-2. Execution of the vcmpps and vcmppd instructions 


The C++ function AvxPackedCompareF 32 begins by initializing a couple of XmmVal test variables. Similar 
to the example that you saw in the previous section, the alignas(16) specifier is used with each XmmVal 
variable to force proper alignment to a 16-byte boundary. The remaining code in this function invokes the 
assembly language AvxPackedCompareF32 and displays the results. Note that on the second iteration of 
the for loop, the constant numeric limits«float»::quiet NaN() is substituted for one of the values in 
XmmVal ato exemplify operation of the ordered and unordered compare predicates. An ordered compare is 
true when both operands are valid values. An unordered compare is true when one or both of the operands 
is a NaN or erroneously encoded. Substituting numeric limits«float»::quiet NaN()forone ofthe values 
in XmmVal a generates a true result for an unordered compare. The C++ code also includes the function 
AvxPackedCompareF64, which is the double-precision counterpart of AvxPackedCompareF32. 

The x86-64 assembly language code begins with an include «cmpequ.asmh» statement. This file, which 
also was used in example Ch05 05, contains compare predicate equates that will be used in this source code 
example with the vcmpp[d | s] instructions. The assembly language function AvxPackeCompareF32 starts 
with two vmovaps instructions that load arguments a and b into registers XMMO and XMMI, respectively. 
The ensuing vcmpps xmm2,xmmo,xmmi,CMP EQ instruction compares packed operands a and b for equality 
and saves the packed result (four doubleword mask values) to register XMM2. The contents of register 
XMM2 are then saved to the result array c using a vmovdga xmmword ptr [r8],xmm2 instruction. The 
remaining code in AvxPackedCompareF32 performs additional compare operations using recognizable 
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compare predicates. The assembly language function AvxPackedCompareF64 demonstrates how to use the 
vcmppd instruction to carry out packed double-precision floating-point compares. Here is the output for 


example Ch06 02: 


Results for AvxPackedCompareF32 (iteration = 0) 

a: 2.000000 7.000000 | -6.000000 3.000000 

b: 1.000000 12.000000 | -6.000000 8.000000 

EQ: 00000000 00000000 | FFFFFFFF 00000000 

NE: FFFFFFFF FFFFFFFF | 00000000 FFFFFFFF 

LT: 00000000 FFFFFFFF | 00000000 FFFFFFFF 

LE: 00000000 FFFFFFFF | FFFFFFFF FFFFFFFF 

GT: FFFFFFFF 00000000 | 00000000 00000000 

GE: FFFFFFFF 00000000 | FFFFFFFF 00000000 

ORDERED: FFFFFFFF FFFFFFFF | FFFFFFFF FFFFFFFF 

UNORDERED: 00000000 00000000 | 00000000 00000000 
Results for AvxPackedCompareF32 (iteration - 1) 

a: nan 7.000000 | -6.000000 3.000000 

b: 1.000000 12.000000 | -6.000000 8.000000 

EQ: 00000000 00000000 | FFFFFFFF 00000000 

NE: FFFFFFFF FFFFFFFF | 00000000 FFFFFFFF 

LT: 00000000 FFFFFFFF | 00000000 FFFFFFFF 

LE: 00000000 FFFFFFFF | FEFFFEFF FEFEFEFE 

GT: 00000000 00000000 | 00000000 00000000 

GE: 00000000 00000000 | FFFFFFFF 00000000 

ORDERED: 00000000 FFFFFFFF | FFFFFFFF FFFFFFFF 

UNORDERED: FFFFFFFF 00000000 | 00000000 00000000 
Results for AvxPackedCompareF64 (iteration = 0) 

a: 2.000000000000 | 3.141592653590 

b: 2.718281828459 | -0.318309886184 

EQ: 0000000000000000 | 0000000000000000 

NE: FFFFFFFFFFFFFFFF | FFFFFFFFFFFFFFFF 

ET: FFFFFFFFFFFFFFFF | 0000000000000000 

LE: FEFFFFFEFFFFFEFF | 0000000000000000 

GT: 0000000000000000 | FFFFFFFFFFFFFFFF 

GE: 0000000000000000 | FFFFFFFFFFFFFFFF 

ORDERED: FFFFFFFFFFFFFFFF | FFFFFFFFFFFFFFFF 

UNORDERED: 0000000000000000 | 0000000000000000 
Results for AvxPackedCompareF64 (iteration = 1) 

a: nan | 3.141592653590 

b: 2.718281828459 | 3.141592653590 

EQ: 0000000000000000 | FFFFFFFFFFFFFFFF 

NE: FFFFFFFFFFFFFFFF | 0000000000000000 

LT: 0000000000000000 | 0000000000000000 

LE: 0000000000000000 | FFFFFFFFFFFFFFFF 
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GT: 0000000000000000 | 0000000000000000 

GE: 0000000000000000 | FFFFFFFFFFFFFFFF 
ORDERED: 0000000000000000 | FFFFFFFFFFFFFFFF 
UNORDERED: FFFFFFFFFFFFFFFF | 0000000000000000 


Packed Floating-Point Conversions 


The next source code example is named Ch06 03. This example shows packed signed doubleword integers 
to floating-point conversions and vice versa. It also illustrates conversions between packed single-precision 
and packed double-precision floating-point values. Listing 6-3 shows the source code for example Ch06 03. 


Listing 6-3. Example Ch06 03 


#include "stdafx.h" 
#include <iostream> 
#include «iomanip» 
#define USE MATH DEFINES 
#include «math.h» 
#include "XmmVal.h" 


using namespace std; 
// The order of values in the following enum must match the jump table 


// that's defined in Cho6 03 .asm. 
enum CvtOp : unsigned int 


( 
}; 


extern "C" bool AvxPackedConvertFP (const XmmVal& a, XmmVal& b, CvtOp cvt op); 


132 F32, F32 132, 132 F64, F64 132, F32 F64, F64 F32, 


void AvxPackedConvertF32(void) 
( 
alignas(16) XmmVal a; 
alignas(16) XmmVal b; 


a.m I32[0] = 10; 
a.m I32[1] = -500; 
a.m I32[2] = 600; 


a.m I32[3] = -1024; 

AvxPackedConvertFP (a, b, CvtOp::132 F32); 
cout << "\nResults for CvtOp::132 F32\n"; 
cout << "a: " << a.ToStringI32() << ‘\n'; 
cout << "b: " << b.ToStringF32() << ‘\n'; 
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a.m F32[0] = 1.0f / 3.0f; 
a.m F32[1] = 2.0f / 3.0f; 
a.m F32[2] = -a.m F32[0] * 2.0f; 


a.m F32[3] = -a.m F32[1] * 2.0f; 

AvxPackedConvertFP (a, b, CvtOp::F32 132); 
cout << "\nResults for CvtOp::F32_132\n"; 
cout << "a: " << a.ToStringF32() << '\n'; 
cout << "b: " << b.ToStringI32() << ‘\n'; 


// F32 F64 converts the two low-order SPFP values of 'a' 
a.m F32[0] = 1.0f / 7.0f; 

a.m F32[1] = 2.0f / 9.0f; 

a.m F32[2] = 0; 

a.m F32[3] = 0; 

AvxPackedConvertFP (a, b, CvtOp::F32 F64); 

cout << "\nResults for CvtOp::F32 F64\n"; 

cout << "a: " << a.ToStringF32() << ‘\n'; 

cout << "b: " << b.ToStringF64() << '\n'; 


} 


void AvxPackedConvertF64(void) 

{ 
alignas(16) XmmVal a; 
alignas(16) XmmVal b; 
// 132 F64 converts the two low-order doubleword integers of 'a' 
a.m I32[0] = 10; 
a.m I32[1] = -20; 
a.m I32[2] = 0; 
a.m I32[3] = 0; 
AvxPackedConvertFP (a, b, CvtOp::132 F64); 
cout << "\nResults for CvtOp::132 F64\n"; 
cout << "a: " << a.ToStringI32() << ‘\n'; 
cout << "b: " << b.ToStringF64() << '\n'; 


// F64 I32 sets the two high-order doublewords of 'b' to zero 
a.m F64[0] = M PI; 

a.m F64[1] = M E; 

AvxPackedConvertFP (a, b, CvtOp::F64 132); 

cout << "\nResults for CvtOp::F64_132\n"; 

cout << "a: " << a.ToStringF64() << '\n'; 

cout << "b: " << b.ToStringI32() << ‘\n'; 


// F64 F32 sets the two high-order SPFP values of 'b' to zero 
a.m F64[0] = M SORT2; 

a.m F64[1] = M SORT1 2; 

AvxPackedConvertFP (a, b, CvtOp::F64 F32); 

cout << "\nResults for CvtOp::F64 F32\n"; 

cout << "a: " << a.ToStringF64() << '\n'; 

cout << "b: " << b.ToStringF32() << '\n'; 
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int main() 

{ 
AvxPackedConvertF32(); 
AvxPackedConvertF64(); 
return 0; 


; extern "C" bool AvxPackedConvertFP (const XmmVal& a, XmmVal& b, CvtOp cvt op); 


; Note: This function requires linker option /LARGEADDRESSAWARE :NO 
3 to be explicitly set. 


. code 
AvxPackedConvertFP_ proc 


; Make sure cvt_op is valid 


mov r9d,r8d 319 = cvt op (zero extended) 
cmp r9,CvtOpTableCount ;is cvt op valid? 

jae InvalidCvtOp ;jmp if cvt op is invalid 

mov eax,1 ;set valid cvt op return code 
jmp [CvtOpTable+r9*8] ;jump to specified conversion 


; Convert packed signed doubleword integers to packed SPFP values 
132 F32: 

vmovdqa xmmO,xmmword ptr [rcx] 

vcvtdq2ps xmm1, xmmo 

vmovaps xmmword ptr [rdx],xmmi 

ret 


; Convert packed SPFP values to packed signed doubleword integers 
F32 I32: 

vmovaps xmmO,xmmword ptr [rcx] 

vcvtps2dq xmm1, xmmo 

vmovdqa xmmword ptr [rdx],xmmi 

ret 


; Convert packed signed doubleword integers to packed DPFP values 
I32 F64: 

vmovdqa xmmO,xmmword ptr [rcx] 

vcvtdq2pd xmm1, xmmo 

vmovapd xmmword ptr [rdx],xmmi 

ret 


; Convert packed DPFP values to packed signed doubleword integers 


F64 I32: 
vmovapd xmmO,xmmword ptr [rcx] 
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vcvtpd2dq xmm1, xmmo 
vmovdqa xmmword ptr [rdx],xmmi 
ret 


; Convert packed SPFP to packed DPFP 
F32 F64: 
vmovaps xmmO,xmmword ptr [rcx] 
vcvtps2pd xmm1, xmmo 
vmovapd xmmword ptr [rdx],xmmi 
ret 


; Convert packed DPFP to packed SPFP 
F64 F32: 
vmovapd xmmo,xmmword ptr [rcx] 
vcvtpd2ps xmm1, xmmo 
vmovaps xmmword ptr [rdx],xmmi 
ret 


InvalidCvtOp: 
XOI eax,eax ;set invalid cvt op return code 
ret 


; The order of values in the following table must match the enum CvtOp 
; that's defined in Cho6 03.cpp. 


align 8 
CvtOpTable qword 132 F32, F32 132 

qword I32 F64, F64 I32 

qword F32 F64, F64 F32 
CvtOpTableCount equ ($ - CvtOpTable) / size qword 


AvxPackedConvertFP  endp 
end 


The C++ code begins with an enum named CvtOP that defines the conversion operations supported by 
the assembly language function AvxPackedConvertFP_. The actual enumerator values in CvtOp are critical 
since the assembly language code uses them as indices into a jump table. The function that follows CvtOp, 
AvxPackedConvertF32, exercises some test cases using packed single-precision floating-point operands. 
Similarly, the function AvxPackedConvertF64 contains test cases for packed double-precision floating-point 
operands. As in the previous examples of this chapter, all XmmVal variable declarations in these functions use 
the alignas(16) specifier to ensure proper alignment. 

Toward the bottom of the assembly language code in Listing 6-3 is the previously mentioned jump 
table. CvtOpTable contains a list of labels that are defined in the function AvxPackedConvertFP .The 
target of each label is a short code block that performs a specific conversion. The equate CvtOpTableCount 
defines the number of items in the jump table and is used to validate the argument value cvt op. The 
align 8 directive instructs the assembler to align CvtOpTable on a quadword boundary in order to avoid 
unaligned memory accesses when referencing elements in the table. Note that CvtOpTable is defined inside 
the assembly language function AvxPackedConvertFP (i.e. between the proc and endp directives), which 
means that storage for the table is allocated in a . code section. Clearly, the jump table does not contain any 
intentional executable instructions, and this is why the table is positioned after the ret instruction. This also 
means that the jump table is read-only; the processor will generate an exception on any write attempt to the 
table. 
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The assembly language function AvxPackedConvertFP begins its execution by validating the argument 
value cvt op. The ensuing jmp [CvtOpTable«r9*8] instruction transfers control to a code block that 
performs the actual packed data conversion. During execution of this instruction, the processor loads 
register RIP with the contents of memory that's specified by [CvxOpTable+r9*8]. In the current example, 
register R9 contains cvt op and this value is used as an index into CvtOpTable. 

The conversion code blocks in AvxPackedConvertFP use the aligned move instructions vmovaps, 
vmovapd, and vmovdqa to transfer packed operands to and from memory. Specific AVX conversion 
instructions carry out the requested operations. For example, the vcvtps2dq and vcvtdq2ps instructions 
perform conversions between packed single-precision floating-point and signed doubleword integer 
values and vice versa. When used with 128-bit wide operands, these instructions convert four values 
simultaneously. The counterpart double-precision instructions, vcvtpd2dq and vcvtdq2pd, are slightly 
different in that only two values are converted due to the element size differences (32 and 64 bits). The 
vcvtps2pd and vcvtpd2ps instructions perform their conversions in a similar manner. Note that the 
vcvtpd2dq and vcvtpd2ps instructions set the high-order 64 bits of the destination operand to zero. All of 
the AVX packed conversion instructions use the rounding mode that's specified by the rounding control field 
MXCSR.RC, as described in Chapter 4. The default rounding mode for Visual C++ is round to nearest. Here is 
the output for example Ch06 03: 


Results for CvtOp::132 F32 
a: 10 -500 | 600 -1024 
b: 10.000000 -500.000000 | 600.000000 -1024.000000 


Results for CvtOp::F32 I32 
a: 0.333333 0.666667 | -0.666667 -1.333333 
b: 0 3- | -1 -1 


Results for CvtOp::F32 F64 
a: 0.142857 0.222222 | 0.000000 0.000000 
b: 0.142857149243 | 0.222222223878 


Results for CvtOp::132 F64 
a: 10 -20 | 0 0 
b: 10.000000000000 | -20.000000000000 


Results for CvtOp::F64 I32 


a: 3.141592653590 | 2.718281828459 
b: 3 3 | 0 0 
Results for CvtOp::F64 F32 

a: 1.414213562373 | 0.707106781187 
b: 1.414214 0.707107 | 0.000000 0.000000 


Packed Floating-Point Arrays 


The computational resources of AVX are often employed to accelerate calculations using arrays of single- 
precision or double-precision floating-point values. In this section, you learn how to use packed arithmetic 
to process multiple elements of a floating-point array simultaneously. You also see examples of additional 
AVX instructions and learn how to perform runtime alignment checks of operands in memory. 
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Packed Floating-Point Square Roots 


Listing 6-4 shows the code for example Ch06 04, which illustrates how to perform a simple packed 
arithmetic calculation using a single-precision floating-point array. It also explains how to perform a 
runtime check of an array's address to ensure that its properly aligned in memory. 


Listing 6-4. Example Ch06 04 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 
#include «random» 


using namespace std; 
extern "C" bool AvxCalcSqrts (float* y, const float* x, size t n); 


void Init(float* x, size t n, unsigned int seed) 


( 
uniform int distribution«» ui dist (1, 2000}; 
default random engine rng {seed}; 


for (size t i = 0; i < n; i++) 
x[i] = (float)ui dist(rng); 
} 


bool AvxCalcSqrtsCpp(float* y, const float* x, size t n) 
( 


const size t alignment - 16; 


if (n == 0) 
return false; 


if (((uintptr t)x % alignment) !- 0) 
return false; 


if (((uintptr t)y % alignment) !- 0) 
return false; 


for (size t i = 0; i < n; i++) 
yli] = sart(x[i]); 


return true; 
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int main() 

1 
const size t n - 19; 
alignas(16) float x[n]; 
alignas(16) float yi[n]; 
alignas(16) float y2[n]; 


Init(x, n, 53); 


bool rci = AvxCalcSqrtsCpp(y1, x, n); 
bool rc2 = AvxCalcSqrts (y2, x, n); 


cout «« fixed «« setprecision(4); 
cout << "\nResults for AvxCalcSqrts Wn"; 


if (!rci || !rc2) 

cout << "Invalid return code\n"; 
else 
{ 

const char* sp =" "5 


for (size t i = 0; i< n; i++) 


{ 
cout << "i: " << setw(2) << i << sp; 
cout << "x: " << setw(9) << x[i] << sp; 
cout << "y1: " << setw(9) << y1[i] << sp; 
cout << "y2: " << setw(9) << y2[i] << '\n'; 
} 
} 
} 
; JP» (P ae ee ne a a ee ae a ee an a ee ee ee ee ee ee ee ee ee 
$ Cho6_04.asm 


J 


; extern "C" bool AvxCalcSqrts (float* y, const float* x, size t n); 


.code 
AvxCalcSqrts proc 
xor eax,eax ;set error return code (also array offset) 
test r8,r8 
jz Done ;jump if n is zero 


test rcx,O0fh 
jnz Done ;junp if 'y' is not aligned 


test rdx,Ofh 
jnz Done ;junp if 'x' is not aligned 
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; Calculate packed square roots 


cmp r8,4 
jb FinalVals ;jump if n< 4 
@@: vsqrtps xmmO,xmmword ptr [rdx+rax] ;calculate 4 square roots x[i+3:i] 


vmovaps xmmword ptr [rcx«rax],xmmO  ;save results to y[i+3:i] 


add rax,16 ;update offset to next set of values 

sub r8,4 

cmp r8,4 ;are there 4 or more elements remaining? 
jae GB ;junp if yes 


; Calculate square roots of final 1 - 3 values, note switch to scalar instructions 
FinalVals: 

test r8,r8 ;more elements to process? 

jz SetRC ;jump if no more elements 


vsqrtss xmmO,xmmO,real4 ptr [rdx«rax]  ;calculate sqrt(x[i]) 
vmovss real4 ptr [rcx«rax], xmmo ;save result to y[i] 
add rax,4 

dec r8 

jz SetRC 


vsqrtss xmmO,xmmO,real4 ptr [rdx+rax] 
vmovss real4 ptr [rcx«rax], xmmo 

add rax,4 

dec r8 

jz SetRC 


vsqrtss xmmO,xmmO,real4 ptr [rdx+rax] 
vmovss real4 ptr [rcx+rax],xmmO 


SetRC: mov eax,1 ;set success return code 


Done: ret 
AvxCalcSqrts  endp 
end 


The C++ code in Listing 6-4 includes a function named AvxCalcSqrtsCpp, which calculates y[i] = 
sqrt(x[i]). Before performing any of the required calculations, array size argument n is tested to make sure 
that's not equal to zero. The pointers y and x are also tested to ensure that the respective arrays are properly 
aligned to a 16-byte boundary. An array is aligned to a 16-byte boundary if its address is evenly divisible by 16. 
The function returns an error code if any of these checks fail. 

Assembly language function AvxCalcSqrts mimics the functionality of its C++ counterpart. The 
test r8,r8 and jz Done instructions ensure that the number of array elements n is greater than zero. The 
ensuing test rcx,Ofh instruction checks array y for alignment to a 16-byte boundary. Recall that the test 
instruction performs a bitwise AND of its two operands and sets the status flags in RFLAGS according to 
the result (the actual result of the bitwise AND is discarded). If the test rcx,Ofh instruction yields a non- 
zero value, array y is not aligned on a 16-byte boundary, and the function exits without performing any 
calculations. A similar test is used to ensure that array x is properly aligned. 

The processing loop uses a vsqrtps instruction to calculate the required square roots. When used 
with 128-bit wide operands, this instruction calculates four single-precision floating-point square roots 
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simultaneously. Using 128-bit wide operands means that the processing loop cannot execute a vsqrtps 
instruction if there are fewer than four element values remaining to be processed. Before performing any 
calculations using vsqrtps, R8 is checked to make sure that it's greater than or equal to four. If R8 is less than 
four, the processing loop is skipped. The processing loop employs a vsqrtps xmmo,xmmword ptr [rdx+rax] 
instruction to calculate square roots of the four single-precision floating-point values located at the memory 
address specified by the source operand. It then stores the calculated square roots in register XMMO. A 
vmovaps xmmword ptr [rcx+rax],xmmO instruction saves the four calculated square roots to y. Execution of 
the vsqrtps and vmovaps instructions continues until the number of elements remaining to be processed is 
less than four. 

Following execution of the processing loop, the block of code starting at label FinalVals calculates the 
square roots for the final few values of array x. Note that the scalar AVX instructions vsqrtss and vmovss 
instructions perform these final (one, two, or three) calculations. Here is the output for source code example 
Cho6 04. 


Results for AvxCalcSqrts 


i: O x: 1354.0000 y1: 36.7967 y2: 36.7967 
i: 1 x: 494.0000 y1: 22.2261 y2: 22.2261 
i; 2 x: 1638.0000 y1: 40.4722 y2: 40.4722 
i: 3 x: 278.0000 y1: 16.6733 y2: 16.6733 
i: 4 x: 1004.0000 y1: 31.6860 y2: 31.6860 
i: 5 x: 318.0000 y1: 17.8326 y2: 17.8326 
i: 6 x: 1735.0000 y1: 41.6533 y2: 41.6533 
ii 7 xX: 1221.0000 y1: 34.9428 y2: 34.9428 
i: 8 x: 544.0000 y1: 23.3238 y2: 23.3238 
i: 9 x: 1568.0000 y1: 39.5980 y2: 39.5980 
i: 10 x: 1633.0000 y1: 40.4104 y2: 40.4104 
i: 11 x: 1577.0000 y1: 39.7115 y2: 39.7115 
i: 12 x: 1659.0000 y1: 40.7308  y2: 40.7308 
i: 13 x: 1565.0000 y1: 39.5601 y2: 39.5601 
i: 14 x: 74.0000 y1: 8.6023  y2: 8.6023 
i: 15 x: 1195.0000 y1: 34.5688 y2: 34.5688 
i: 16 x: 406.0000 y1: 20.1494 y2: 20.1494 
i: 17 x: 483.0000 y1: 21.9773 y2: 21.9773 
i: 18 x: 1307.0000 y1: 36.1525  y2: 36.1525 


The source code in Listing 6-4 can be easily adapted to process double-precision instead of single- 
precision floating-point values. In the C++ code, changing all float variables double is the only required 
modification. In the assembly language code, the vsqrtpd and vmovapd instructions must be used instead 
of vsqrtps and vmovaps. The counting variables in AvxCalcSqrts must also be changed to process two 
double-precision instead of four single-precision floating-point values per iteration. 
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Packed Floating-Point Array Min-Max 


Listing 6-5 shows the source code for example Ch06 05. This example demonstrates how to compute the 
minimum and maximum value of a single-precision floating-point array using packed AVX instructions. 


Listing 6-5. Example Ch06 05 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 
#include «random» 
#include «limits» 
#include "AlignedMem.h" 


using namespace std; 


extern "C" float g MinValInit = numeric limits«float»::max(); 
extern "C" float g MaxVallnit - -numeric limits«float»::max(); 


extern "C" bool CalcArrayMinMaxF32 (float* min val, float* max val, const float* x, size t n); 


void Init(float* x, size t n, unsigned int seed) 


( 
uniform int distribution«» ui dist (1, 10000]; 
default random engine rng {seed}; 
for (size t i = 0; i< n; i++) 
x[i] = (float)ui dist(rng); 
} 


bool CalcArrayMinMaxF32Cpp(float* min val, float* max val, const float* x, size t n) 


{ 


// Make sure x is properly aligned 
if (!AlignedMem::IsAligned(x, 16)) 
return false; 


// Find array minimum and maximum values 
float min val temp = g MinValInit; 
float max val temp = g MaxValInit; 


if (n > 0) 
( 
for (size t i = 0; i« n; i++) 
{ 
if (x[i] « min val temp) 
min val temp = x[i]; 
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if (x[i] > max val temp) 
max val temp = x[i]; 


*min val - min val temp; 
*max val = max val temp; 
return true; 


main() 


const size t n = 31; 
alignas(16) float x[n]; 


Init(x, n, 73); 


float min vali, max vali; 
float min val2, max val2; 


CalcArrayMinMaxF32Cpp(&min vali, &max vali, x, n); 
CalcArrayMinMaxF32 (&min val2, &max_val2, x, n); 


cout «« fixed «« setprecision(1); 


cout «« "-------------- Array X -------------- \n"; 


for (size t i = 0; i< n; i++) 
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{ 
cout << "x[" << setw(2) << i << "]: " << setw(9) << x[i]; 
if (i & 1) 
cout << ‘\n'; 
else 
cout << " ms 
} 


cout << ‘\n'; 


cout << "\nResults for CalcArrayMinMaxF32Cpp\n"; 


cout << min val = " << setw(9) << min vali << ", "; 


cout «« max val = " << setw(9) << max vali << ‘\n'; 


cout << "\nResults for CalcArrayMinMaxF32 Mn"; 


cout << " min val = " << setw(9) << min val2 << ", "; 
cout << " max val = " << setw(9) << max val2 << '\n'; 
return 0; 
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; extern "C" bool CalcArrayMinMaxF32 (float* 


extern g MinVallnit:real4 
extern g MaxVallnit:real4 


.code 


CalcArrayMinMaxF32 proc 
; Validate arguments 


XOI eax,eax 


test r8,0fh 


min val, float* max val, const float* x, size t n) 


;set error return code 


;is x aligned to 16-byte boundary? 


jnz Done ;jump if no 

vbroadcastss xmm4,real4 ptr [g MinValInit] ;xmm4 - min values 
vbroadcastss xmm5,real4 ptr [g MaxValInit] ;xmm5 = max values 
cmp r9,4 


jb FinalVals 


; Main processing loop 
vmovaps xmmO,xmmword ptr [r8] 


QQ: 


vminps xmm4 ,xmm4 , xmmO 
vmaxps xmm5 ,xmm5 , xmmO 


add r8,16 
sub r9,4 
cmp 19,4 
jae @B 


jjump if n< 4 


;load next set of array values 
;update packed min values 
;update packed max values 


; Process the final 1 - 3 values of the input array 
FinalVals: 
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test r9,r9 
jz SaveResults 


vminss xmm4,xmm4,real4 ptr 
vmaxss xmm5,xmm5,real4 ptr 
dec r9 

jz SaveResults 


vminss xmm4,xmm4,real4 ptr 
vmaxss xmm5,xmm5,real4 ptr 
dec r9 

jz SaveResults 


vminss xmm4,xmm4,real4 ptr 
vmaxss xmm5,xmm5,real4 ptr 


[r8] 
[r8] 


[r8+4] 
[r8+4] 


[r8+8] 
[r8+8] 


;update packed min values 
;update packed max values 
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; Calculate and save final min & max values 


SaveResults: 
vshufps xmmo,xmm4, xmm4 , 00001110b ;xmmo[63:0] = xmm4[128:64] 
vminps xmm1,xmmO, xmm4 ;xmm1[63:0] contains final 2 values 
vshufps xmm2,xmm1,xmm1,00000001b ;xmm2[31:0] = xmm1[63:32] 
vminps xmm3,xmm2, Xmm1 ;xmm3[31:0] contains final value 
vmovss real4 ptr [rcx],xmm3 ;save array min value 
vshufps xmmo,xmm5,xmm5,00001110b 
vmaxps xmm1, xmmo , xmm5 
vshufps xmm2,xmm1,xmm1,00000001b 
vmaxps xmm3, xmm2 , xmm31 
vmovss real4 ptr [rdx],xmm3 ;save array max value 
mov eax,1 ;set success return code 


Done: ret 
CalcArrayMinMaxF32  endp 
end 


The structure of the C++ source code that's shown in Listing 6-5 is similar to the previous array example. 
The function CalcArrayMinMaxF32Cpp uses a simple for loop to determine the array's minimum and 
maximum values. Prior to the for loop, the template function AlignedMem: : IsAligned verifies that source 
array x is properly aligned. You'll learn more about class AlignedMem in Chapter 7. The initial minimum 
and maximum values are obtained from the global variables g MinValInit and g MaxVallnit, which were 
initialized using the C++ template constant numeric limits«float»::max(). Global variables are employed 
here to ensure that the functions CalcArrayMinMaxF32Cpp and CalcArrayMinMaxF32 use the same initial 
values. 

Upon entry to the assembly language function CalcArrayMinMaxF32 , the array x is tested for proper 
alignment. If array x is properly aligned, a vbroadcastss xmm4,real4 ptr [g MinValInit] instruction 
initializes all four single-precision floating-point elements in register XMM4 with the value g MinVallnit. 
The subsequent vbroadcastss xmm5,real4 ptr [g MaxValInit] instruction broadcasts g MaxValInit to 
all four element positions in register XMM5. 

Like the previous example, the processing loop in CalcArrayMinMaxF32 examines four array elements 
during each iteration. The vminps xmm4,xmm4,xmmO and vmaxps xmm5,xmm5, xmmO instructions maintain 
intermediate packed minimum and maximum values in registers XMM4 and XMMS, respectively. The 
processing loop continues until there fewer than four elements remaining. The final elements in the array 
are tested using the scalar instructions vminss and vmaxss. 

Subsequent to the execution of the vmaxss instruction that's immediately above the label SaveResults, 
register XMMA contains four single-precision floating-point values, and one of these values is the minimum 
for array x. A series of vshufps (Packed Interleave Shuffle Single-Precision Floating-Point Values) and vminps 
instructions is then used to determine the final minimum value. The vshufps xmmo,xmm4,xmm4 , 00001110b 
instruction copies the two high-order floating-point elements in register XMMA to the low-order element 
positions in XMMO (i.e., XMMO[63:0] = XMM4[127:64]). This instruction uses the bit values of its immediate 
operand as indices for selecting elements to copy. 

The immediate operand that's used by the vshufps instruction warrants further explanation. In the 
current example, bits 1:0 (10b) of the immediate operand instruct the processor to copy single-precision 
floating-point element #2 (XMMA[95:64]) from the first source operand to element position #0 (XMMO[31:0]) 
of the destination operand. Bits 3:2 (11b) of the immediate operand also instruct the processor to copy 
element #3 (XMM4|[127:64]) of the first source operand to element position #1 (XMMO[63:32]) of the 
destination operand. Bits 7:6 and 5:4 of the immediate operand can be used to copy elements from the 
second source operand to element positions 42 (XMMO0[95:64]) and #3 (XMMO[127:96]) of the destination 
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operand, but they're not needed in the current example. The vshufps instruction is followed by a vminps 
xmm1 , xmmo , xmm4 instruction that yields the final two minimum values in XMM1 [63:32] and XMM1[31:0]. 
Another sequence of vshufps and vminps instructions is then used to extract the final minimum value. 
Figure 6-3 illustrates this reduction process in greater detail. 


Initial values 


vshufps xmmO0,xmm4,xmm4,00001110b 


xmmO 


vminps xmml,xmmO, xmm4 


vshufps xmm2,xmm1,xmm1,000000001b 


vminps xmm3,xmm2, xmm1 


ES = Don't care value 


Figure 6-3. Packed minimum reduction using vshufps and vminps instructions 


Following calculation of the array's minimum value, an analogous series of vshufps and vmaxps 
instructions determines the maximum value using the same reduction technique. Here are the results for 
example Ch06 05: 


HS - Array X -------------- 
x[ 0]: 2183.0  x[ 1]: 4547.0 
x[ 2]: 9279.0 x[ 3]: 7291.0 
x[ 4]: 5105.0 x[ 5]: 6505.0 
x[ 6]: 4820.0 x[ 7]: 994.0 
x[ 8]: 1559.0 x{[ 9]: 3867.0 
x[10]: 7272.0 x[11]: 9698.0 
x[12]: 6181.0 x[13]: 4742.0 
x[14]: 7279.0 x[15]: 1224.0 
x[16]: 4840.0  x[17]: 8453.0 
x[18]: 6876.0 x[19]: 1786.0 
x[20]: 4022.0 x[21]: 911.0 
x[22]: 6676.0 x[23]: 2979.0 
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x[24]: 4431.0 x[25]: 6133.0 
[26]: 7093.0  x[27]: 9892.0 
x[28]: 9622.0 x[29]: 5058.0 
x[30]: 1264.0 


Results for CalcArrayMinMaxF32Cpp 
min val - 911.0, max val = 9892.0 


Results for CalcArrayMinMaxF32 
min val - 911.0, max val = 9892.0 


Packed Floating-Point Least Squares 


Source code example Ch06 06 details the calculation of a least squares regression line using packed double- 
precision floating-point arithmetic. Listing 6-6 shows the C++ and x86 assembly language source code for 
example Ch06_ 06. 


Listing 6-6. Example Ch06_06 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 
#include <cstddef> 
#include "AlignedMem.h" 


using namespace std; 
extern "C" double LsEpsilon = 1.0e-12; 
extern "C" bool AvxCalcLeastSquares (const double* x, const double* y, int n, double* m, 
double* b); 
bool AvxCalcLeastSquaresCpp(const double* x, const double* y, int n, double* m, double* b) 
if (n < 2) 
return false; 
if (!AlignedMem::IsAligned(x, 16) || !AlignedMem::IsAligned(y, 16)) 
return false; 


double sum x = 0, sum y = 0.0, sum xx = O, sum xy = 0.0; 


for (int i = 0; i < n; i+) 


{ 
sum x += x[i]; 
sum xx += x[i] * x[i]; 
sum xy += x[i] * y[i]; 
sum y *- y[i]; 

} 
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double denom - n * sum xx - sum x * sum x; 
if (fabs(denom) >= LsEpsilon) 
*m = (n * sum xy - sum x * sum y) / denom; 


*b = (sum xx * sum y - sum x * sum xy) / denom; 
return true; 


) 
else 
( 
*m = *b = 0.0; 
return false; 
) 
j 
int main() 
{ 
const int n = 11; 
alignas(16) double x[n] = {10, 13, 17, 19, 23, 7, 35, 51, 89, 92, 99}; 
alignas(16) double y[n] = (1.2, 1.1, 1.8, 2.2, 1.9, 0.5, 3.1, 5.5, 8.4, 9.7, 10.4}; 
double m1 = 0, m2 = 0; 
double b1 = 0, b2 = 0; 
bool rci = AvxCalcLeastSquaresCpp(x, y, n, &m1, &b1); 
bool rc2 = AvxCalcLeastSquares (x, y, n, &m2, 8b2); 
cout «« fixed «« setprecision(8); 
cout << "\nResults from AvxCalcLeastSquaresCpp\n"; 
cout << " rc: " << setw(12) << boolalpha << rci << '\n'; 
cout << " slope: " << setw(12) << m1 << '\n'; 
cout << " intercept:: " << setw(12) << b1 << '\n'; 
cout << "\nResults from AvxCalcLeastSquares_\n"; 
cout << " rc: " << setw(12) << boolalpha << rc2 << '\n'; 
cout << " slope: " << setw(12) << m2 << '\n'; 
cout << " intercept:: " << setw(12) << b2 << '\n'; 
return 0; 
} 
P mm 
$ Cho6 06.asm 


include «MacrosX86-64-AVX.asmh» 


extern LsEpsilon:real8 ;global value defined in C++ file 
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; extern "C" bool AvxCalcLeastSquares (const double* x, const double* y, int n, double* m, 


double* b); 


3 


; Returns O = error (invalid n or improperly aligned array), 1 = success 


.const 
align 16 


AbsMaskF64 quord 7FFffffffFFFFFFFh, 7FFFEFEFFFFFFFFER 


.code 

AvxCalcLeastSquares_ proc frame 
_CreateFrame LS ,0,48,rbx 
_SaveXmmRegs xmm6, xmm7 , xmm8 
_EndProlog 


; Validate arguments 
xor eax,eax 
cmp r8d,2 
jl Done 
test rcx,Ofh 
jnz Done 
test rdx,Ofh 
jnz Done 


; Perform required initializations 
vcvtsi2sd xmm3,xmm3,r8d 
mov eax,r8d 
and r8d,offfffffeh 
and eax,1 


vxorpd xmm4 ,xmm4 , xmm4 
vxorpd xmm5 ,xmm5 , xmm5 
vxorpd xmm6, xmm6 , xmm6 
vxorpd xmm7, xmm7 , xmm7 


xor ebx,ebx 
mov r10,[rbp«LS OffsetStackArgs] 


; Calculate sum variables. Note that two values are processed each iteration. 
QQ : vmovapd xmmo,xmmword ptr [rcx+rbx] 
vmovapd xmm1,xmmword ptr [rdx+rbx] 


vaddpd xmm4, xmm4., xmmo 
vaddpd xmm5, xmm5 , xmm1 


vmulpd xmm2,xmmo, xmmo 
vaddpd xmm6,xmm6 , xmm2 


vmulpd xmm2,xmmO, xmm1 
vaddpd xmm7,xmm7 , xmm2 


;set error return code 


; jump 


;jump if x not aligned to 16-byte boundary 


;jump if y not aligned to 16-byte boundary 


;xmm3 


;1d8 
;eax 


jsum 
jsum 
jsum 
jsum 


;rbx 
;110 
; load 
; load 


;upda 
;upda 


;calc 
;upda 


;calc 
;upda 


ifn<2 


=n 


-n/2*2 
=n%2 


x (both qwords) 
y (both qwords) 
xx (both qwords) 
xy (both qwords) 


- array offset 
b 


next two x values 
next two y values 


te sum x 
te sum y 


x * x 
te sum xx 


x * y 
te sum xy 


;mask for DPFP absolute value 
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add rbx,16 srbx = next offset 
sub r8d,2 ;adjust counter 
jnz GB ;repeat until done 


; Update sum variables with the final x, y values if 'n' is odd 
Or eax,eax 

jz CalcFinalSums 

vmovsd xmmO,real8 ptr [rcx+rbx] 


vmovsd xmm1,real8 ptr [rdx+rbx] 


;jump if n is even 
;load final x 
;load final y 


vaddsd xmm4,xmm4, xmmO ;update sum x 
vaddsd xmm5, xmm5 , xmm1 ;update sum y 
vmulsd xmm2,xmmo,xmmo ;calc x * x 
vaddsd xmm6 ,xmm6 , xmm2 ;update sum xx 
vmulsd xmm2,xmmo,xmm1 ;calc x * y 
vaddsd xmm7,xmm7 , xmm2 ;update sum xy 


; Calculate final sum x, sum y, sum xx, sum xy 


CalcFinalSums: 
vhaddpd xmm4, xmm4 , xmm4 ;xmm4[63:0] = final sum x 
vhaddpd xmm5, xmm5, xmm5 ;xmm5[63:0] = final sum y 
vhaddpd xmm6,xmm6, xmm6 ;xmm6[63:0] = final sum xx 
vhaddpd xmm7,xmm7 , xmm7 ;xmm7[63:0] = final sum xy 


; Compute denominator and make sure it's valid 
; denom - n * sum xx - sum x * sum x 


vmulsd xmmo,xmm3,xmm6 jn * sum xx 
vmulsd xmm1,xmm4,xmm4 ;sum x * sum x 
vsubsd xmm2,xmmo, xmm1 ; denom 


vandpd xmm8,xmm2,xmmword ptr [AbsMaskF64] ;fabs(denom) 
vcomisd xmm8,real8 ptr [LsEpsilon] 


jb BadDen ;jump if denom < fabs(denom) 


; Compute and save slope 
sum xy - sum x * sum y) / denom 


; slope - (n * 


vmulsd xmmo,xmm3,xmm7 ;n * sum xy 
vmulsd xmmi,xmm4,xmm5 ;sum x * sum y 
vsubsd xmm2,xmmo, xmm1 ;Slope numerator 
vdivsd xmm3,xmm2, xmm8 ;final slope 
vmovsd real8 ptr [r9],xmm3 ;save slope 


; Compute and save intercept 
; intercept - (sum xx * sum y - sum x * sum xy) / denom 


vmulsd xmmo,xmm6,xmm5 jsum xx * sum y 
vmulsd xmmi,xmm4,xmm7 ;sum x * sum xy 
vsubsd xmm2, xmmO, xmm1 ;intercept numerator 
vdivsd xmm3,xmm2, xmm8 ;final intercept 
vmovsd real8 ptr [r10],xmm3 ;save intercept 
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mov eax,1 ;success return code 
jmp Done 


; Bad denominator detected, set m and b to 0.0 

BadDen: vxorpd xmmO,xmmO, xmmO 
vmovsd real8 ptr [r9],xmmo ;*m = 0.0 
vmovsd real8 ptr [r10],xmmo ;*b = 0.0 
XOI eax,eax ;set error code 


Done: 4 RestoreXmmRegs xmm6,xmm7, xmm8 
_DeleteFrame rbx 
ret 

AvxCalcLeastSquares endp 
end 


Simple linear regression is a statistical technique that models a linear relationship between two 
variables. One popular method of simple linear regression is called /east squares fitting, which uses a set of 
sample data points to determine a best fit or optimal curve between two variables. When used with a simple 
linear regression model, the curve is a straight line whose equation is y = mx + b. In this equation, x denotes 
the independent variable, y represents the dependent (or measured) variable, m is the line's slope, and bis 
the line's y-axis intercept point. The slope and intercept point of a least squares line are determined using a 
series of computations that minimize the sum of the squared deviations between the line and sample data 
points. Following calculation of its slope and intercept point, a least squares line is frequently used to predict 
an unknown y value using a known x value. If you're interested in learning more about the theory of simple 
linear regression and least squares fitting, consult the references listed in Appendix A. 

In sample program Ch06 06, the following equations are used to calculate the least squares slope and 
intercept point: 


DE x) 
, Ebo re 
22: (x) 


m 


At first glance, the slope and intercept equations may appear a little daunting. However, upon closer 
examination, a couple of simplifications become apparent. First, the slope and intercept point denominators 
are the same, which means that this value only needs to be computed once. Second, it is only necessary to 
calculate four simple summation quantities (or sum variables), as shown in the following equations: 


sum_x= Vx, 
i 
sum y- $y; 
i 
sum_xy = È x;y; 
i 


sum_xx= + 


i 
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Subsequent to the calculation of the sum variables, the least-squares slope and intercept point are easily 
derived using straightforward multiplication, subtraction, and division. 

The C++ source code in Listing 6-6 includes a function named AvxCalcLeastSquaresCpp that calculates 
a least-squares slope and intercept point for comparison purposes. AvxCalcLeastSquaresCpp uses 
AlignedMem: : IsAligned() to validate proper alignment of the two data arrays. The C++ class AlignedMem 
(source code not shown but included in the download package) contains a few simple member functions 
that perform aligned memory management and validation. These functions have been incorporated into a 
C++ class to facilitate code reuse in this example and subsequent chapters. The C++ function main defines 
a couple of test arrays named x and y using the C++ specifier alignas(16), which instructs the compiler to 
align each of these arrays on a 16-byte boundary. The remainder of main contains code that exercises both 
the C++ and x86 assembly language implementations of the least squares algorithm and streams the results 
to cout. 

The x86-64 assembly language code for function AvxCalcLeastSquares begins with saves of 
non-volatile registers RBX, XMM6, XMM7, and XMM8 using the macros CreateFrame and SaveXmmRegs. 
Argument value n is then validated for size, and the array pointers x and y are tested for proper alignment. 
Following validation of the function arguments, a series of initializations is performed. The vcvtsi2sd 
xmm3 , xmm3 , r8d instruction converts the value n to double-precision floating-point for later use. The value 
nin R8D is then rounded down to the nearest even number using an and r8d,0fffffffeh instruction and 
EAX is set to zero or one depending on whether the original value of n is even or odd. These adjustments are 
carried out to ensure proper processing of arrays x and y using packed arithmetic. 

Recall from the discussions earlier in this section that in order to compute the slope and intercept 
point of a least squares regression line, you need to calculate four intermediate sum values: sum x, sum y, 
sum xx, and sum xy. The summation loop that calculates these values in AvxCalcLeastSquares_ uses 
packed double-precision floating-point arithmetic. This means that AvxCalcLeastSquares_ can process two 
elements from arrays x and y during each loop iteration, which halves the number of required iterations. The 
sum values for array elements with even-numbered indices are computed using the low-order quadwords of 
XMM4-XMM7, while the high-order quadwords are used to calculate the sum values for array elements with 
odd-numbered indices. 

Prior to entering the summation loop, each sum value register is initialized to zero using a vxorpd 
instruction. At the top of the summation loop, a vmovapd xmmO,xmmword ptr [rcx«rbx] instruction copies 
x[i] and x[i+1] into the low-order and high-order quadwords of XMMO, respectively. The next instruction, 
vmovapd xmm1,xmmword ptr [rdx+rbx], loads y[i] and y[i+1] into the low-order and high-order 
quadwords of XMM1.A series of vaddpd and vmulpd instructions update the packed sum values that are 
maintained in XMM4 - XMM7. Array offset register RBX is then incremented by 16 (or the size of two double- 
precision floating-point values) and the count value in R8D is adjusted before the next summation loop 
iteration. Following completion of the summation loop, a check is made to determine if the original value of 
n was odd. If true, the final element of array x and array y must be added to the packed sum values. The AVX 
scalar instructions vaddsd and vmulsd carry out this operation. 

Following computation of the packed sum values, a series of vhaddpd (Packed Double-FP Horizontal 
Add) instructions compute the final values of sum x, sum y, sum xx, and sum xy. Each vhaddpd 
DesOp, SrcOpi,SrcOp2 instruction computes DesOp[63:0] = SrcOp1[127:64] + SrcOp1[63:0] and 
DesOp[127:64] = SrcOp2[127:64] + SrcOp2[63:0] (see Figure 4-14). Subsequent to the execution of the 
vhaddpd instructions, the low-order quadwords of registers XMM4 - XMM7 contain the final sum values. The 
high-order quadwords of these registers also contain the final sum values, but this is a consequence of using 
the same register for both source operands. The value of denom is computed next and tested to make sure its 
absolute value is greater than or equal to LsEpsilon; an absolute value less than LsEpsilon is considered 
too close to zero to be valid. Note that a vandpd instruction is used to calculate fabs (denom). After validation 
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of denom, the slope and intercept values are calculated using straightforward scalar arithmetic. Here is the 
output for source code example Ch06_06: 


Results from AvxCalcLeastSquaresCpp 


Ic: true 
slope: 0.10324631 
intercept:: -0.10700632 


Results from AvxCalcLeastSquares | 


rc: true 
slope: 0.10324631 
intercept:: -0.10700632 


Packed Floating-Point Matrices 


Software applications such as computer graphics and computer-aided design programs often make 
extensive use of matrices. For example, three-dimensional (3D) computer graphics software typically 
employs matrices to perform common transformations such as translation, scaling, and rotation. When 
using homogeneous coordinates, each of these operations can be efficiently represented using a single 4 x 4 
matrix. Multiple transformations can also be applied by merging a series of distinct transformation matrices 
into a single transformation matrix using matrix multiplication. This combined matrix is typically applied to 
an array of object vertices that defines a 3D model. It is important for 3D computer graphics software to carry 
out operations such as matrix multiplication and matrix-vector multiplication as quickly as possible since a 
3D model may contain thousands or even millions of object vertices. 

In this section, you learn how to perform matrix transposition and multiplication using 4 x 4 matrices 
and the AVX instruction set. You also learn more about assembly language macros, how to write macro code, 
and some simple techniques for benchmarking algorithm performance. 


Matrix Transposition 


The transpose of a matrix is calculated by interchanging its rows and columns. More formally, if A is an m x n 
matrix, the transpose of A (denoted here by B) is an n x m matrix, where b(i,j) = a(j i). Figure 6-4 illustrates 
the transposition of a 4 x 4 matrix. 


Matrix A Transpose of Matrix A 

2 7 8 3 2 1 24 31 
11 14 16 10 7 14 21 34 
~|24 21 27 29 B-[a 16 27 38 
31 34 38 33 3 10 29 33 


Figure 6-4. Transposition of a 4 x 4 matrix 
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Listing 6-7 shows the source code for example Ch06 07, which demonstrates how to transpose a 4 x 4 
matrix of single-precision floating-point values. 


Listing 6-7. Example Ch06 07 


#pragma once 


// Cho6 07 .asm 
extern "C" void AvxMat4x4TransposeF32 (float* m des, const float* m src); 


// Cho6 07 BM.cpp 
extern void AvxMat4x4TransposeF32_BM(void) ; 


#include "stdafx.h" 
#include <iostream> 
#include «iomanip» 

#include "Cho6 07.h" 
#include "Matrix.h" 


using namespace std; 


void AvxMat4x4TransposeF32(Matrix<float>& m src) 


{ 

const size t nr = 4; 

const size t nc = 4; 

Matrix«float» m desi(nr ,nc); 

Matrix«float» m des2(nr ,nc); 

Matrix«float»::Transpose(m desi, m src); 

AvxMat4x4TransposeF32 (m des2.Data(), m src.Data()); 

cout «« fixed «« setprecision(1); 

m src.SetOstream(12, " "); 

m desi.SetOstream(12, " "); 

m des2.SetOstream(12, " "); 

cout << "Results for AvxMat4x4TransposeF32\n"; 

cout << "Matrix m src Mn" << m src << '\n'; 

cout << "Matrix m des1\n" << m desi << 'Mn'; 

cout << "Matrix m des2\n" << m des2 << ‘\n'; 

if (m des1 !- m des2) 

cout << "\nMatrix compare failed - AvxMat4x4TransposeF32\n"; 

} 
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int main() 


{ 


const 
const 


size t nr = 4; 
size t nc = 4; 


Matrix«float» m src(nr ,nc); 


const 
const 
const 
const 


m src. 
m src. 
m src. 
m src. 


float src rowo[] 
float src row1[] 
float src row2[] 
float src row3[] 


= { 
ed 
= {2 
= {3 


SetRow(0, src rowO); 
SetRow(1, src rowi1); 
SetRow(2, src row2); 
SetRow(3, src row3); 


2, 


1, 
4, 
1, 


AvxMat4x4TransposeF32(m src); 
AvxMat4x4TransposeF32 BM(); 
return 0; 
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3 }; 
14, 16, 10 }; 
21, 27, 29 }; 
34, 38, 33 }; 


include <MacrosX86-64-AVX.asmh> 


_Mat4x4TransposeF32 macro 


Description: 


floating-point values. 


Input Matrix 


a3 a2 a1 a0 
b3 b2 bi bo 
C3 c2 C1 CO 
d3 d2 di dO 


_Mat4x4TransposeF32 macro 
vunpcklps xmm6,xmmo,xmm1 
vunpckhps xmmo, xmmo , xmm1 
vunpcklps xmm7,xmm2,xmm3 
vunpckhps xmmi,xmm2, xmm3 


vmovlhps xmm4,xmm6,xmm7 
vmovhlps xmm5 ,xmm7 , xmm6 
vmovlhps xmm6, xmmo, xmm1 
vmovhlps xmm7,xmm1,xmmO 
endm 


Output Matrix 


dO cO bo 
di ci bi 
d2 c2 b2 
d3 c3 b3 


xmm4 
xmm5 
xmm6 
xmm7 


;xmm6 = 
;xmmo = 
;xmm7 = 
;xmmi1 = 


;xmm4 = 
;xmm5 = 
;xmm6 = 
;xmm7 = 


a1 
33 
ci 
c3 


co 
c1 
c2 
c3 


bo 
b2 
do 
d2 


bo 
b1 
b2 
b2 
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This macro transposes a 4x4 matrix of single-precision 


a0 
a2 
co 
c2 


a0 
al 
a2 
a3 
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; extern "C" void AvxMat4x4TransposeF32 (float* m des, const float* m src) 


.code 
AvxMat4x4TransposeF32 proc frame 

_CreateFrame MT ,0,32 

_SaveXmmRegs xmm6, xmm7 


_EndProlog 
; Transpose matrix m srci 
vmovaps xmmo, [rdx] ;xmmO = m src.row O 
vmovaps xmm1, [rdx+16 ] ;xmm1 = m src.row 1 
vmovaps xmm2, [rdx+32 ] ;xmm2 = m src.row 2 
vmovaps xmm3,[rdx+48] ;xmm3 = m src.row 3 
_Mat4x4TransposeF32 
vmovaps [rcx],xmm4 ;save m des.row O 
vmovaps [rcx+16],xmm5 ;save m des.row 1 
vmovaps [rcx+32],xmm6 ;save m des.row 2 
vmovaps [rcx448],xmm7 ;save m des.row 3 
Done: X RestoreXmmRegs xmm6,xmm7 
_DeleteFrame 
ret 
AvxMat4x4TransposeF32_ endp 
end 
[l------------------------------------------------ 
// Cho6_07_BM.cpp 
[l------------------------------------------------ 


#include "stdafx.h" 
#include <iostream> 
#include <string> 
#include "Cho6_07.h" 
#include "Matrix.h" 
#include "BmThreadTimer.h" 
#include "OS.h" 


using namespace std; 


extern void AvxMat4x4TransposeF32_BM(void) 
{ 
0S: :SetThreadAffinityMask(); 
cout << "\nRunning benchmark function AvxMat4x4TransposeF32 BM - please wait Wn"; 


const size t num rows = 4; 
const size t num cols - 4; 
Matrix«float» m src(num rows, num cols); 
Matrix«float» m desi(num rows, num cols); 
Matrix«float» m des2(num rows, num cols); 
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1 10, 11, 12, 13 }; 
{ 14, 15, 16, 17 }; 
{ 18, 19, 20, 21 }; 
{ 22, 23, 24, 25 }; 


const float m src ro[] 
const float m src r1[] 
] 
] 


const float m src r2[ 
const float m src r3[ 


m src.SetRow(0, m src r0); 
m src.SetRow(1, m src r1); 
m src.SetRow(2, m src r2); 
m src.SetRow(3, m src r3); 


const size t num it - 500; 
const size t num alg - 2; 
const size t num ops - 1000000; 


BmThreadTimer bmtt(num it, num alg); 


for (size t i = 0; i < num it; i++) 


( 
bmtt.Start(i, 0); 
for (size t j = 0; j « num ops; j++) 
Matrix«float»::Transpose(m desi, m src); 
bmtt.Stop(i, 0); 
bmtt.Start(i, 1); 
for (size t j = 0; j < num ops; j++) 
AvxMat4x4TransposeF32 (m des2.Data(), m src.Data()); 
bmtt.Stop(i, 1); 
J 


string fn = bmtt.BuildCsvFilenameString("Cho6_07_AvxMat4x4TransposeF32_BM"); 
bmtt.SaveElapsedTimes(fn, BmThreadTimer::EtUnit::MicroSec, 2); 
cout << "Benchmark times save to file " << fn << '\n'; 


The function main begins by instantiating a 4 x 4 single-precision floating-point test matrix named 
m srcusing the C++ template Matrix. This template, which is defined in the header file Matrix.h (source 
code not shown), contains C++ code that implements a simple matrix class for test and benchmarking 
purposes. The internal buffer allocated by Matrix is aligned on a 64-byte boundary, which means that 
objects of type Matrix are properly aligned for use with AVX, AVX2, and AVX-512 instructions. The function 
main calls AvxMat4x4TransposeF32, which exercises the matrix transposition functions written in C++ and 
assembly language. The results of these transpositions are then streamed to cout. The function main also 
invokes a benchmarking function named AvxMat4x4TransposeF32 BM that measures the performance of 
each transposition function as explained later in this section. 

Near the top of assembly language code is a macro named Mat4x4TransposeF32. You learned in 
Chapter 5 that a macro is an assembler text substitution mechanism that allows a single text string to 
represent a sequence of assembly language instructions, data definitions, or other statements. During 
assembly of an x86 assembly language source code file, the assembler replaces any occurrence of the macro 
name with the statements that are declared between the macro and endm directives. Assembly language 
macros are typically employed to generate sequences of instructions that will be used more than once. 
Macros are also frequently used to avoid the performance overhead of a function call. 
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The macro. Mat4x4TransposeF32 contains AVX instructions that transpose a 4 x 4 matrix of single- 
precision floating-point values. This macro requires the rows of the source matrix to be loaded into registers 
XMM0 - XMM3 prior to its use. It then employs a series of vunpcklps, vunpckhps, vmovlhps, and vmovhlps 
instructions to transpose the source matrix, as illustrated in Figure 6-5. Following execution of these 
instructions, the transposed matrix is stored in registers XMM4-XMM7. 


xmmO 


2 7 & 3 
11 14 16 10 


— 
24 21 27 29 29 EE, v EY 


31 34 38 33 


[sg 
|] 
BE 
vapckips sam md so 
vunpcklps xmm7, xmm2, xmm3 EN 
NEN] 


xmmi 


xmm3 


xmm6 
vunpckhps xmmO,xmmO, xmmi 
xmm7 


vunpckhps xmml,xmm2,xrm3 27 xmmi 


xmm4 


xmm5 


vmovhlps xmm5,xmm7, xmm6 3 


vmovlhps xmm6, xmmO, xmm1 xmm6 


vmovhlps xmm7,xmm1, xmmO xmm7 


vmovlhps xmm4,xmm6, xmm7 3 


2 11 24 31 
n 7 14 21 34 
8 16 27 38 
3 10 29 33 


Figure 6-5. Instruction sequence used by Mat4x4TransposeF32 to transpose a 4 x 4 matrix of single-precision 
floating-point values 


The macro Mat4x4TransposeF32 is used by the assembly language function AvxMat4x4Transpose4x4 . 
Immediately following its function prolog, function AvxMat4x4Transpose4x4 executes a series of vmovaps 
instructions to load the source matrix into registers XMMO0 - XMM3. Each XMM register contains one row 
of the source matrix. The macro. Mat4x4TransposeF32 is then employed to transpose the matrix. Figure 6-6 
contains an excerpt from the MASM listing file that shows the macro expansion of Mat4x4TransposeF32. 
This figure also shows the expansions of the prolog and epilog macros. The listing file symbolizes macro 
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expanded instructions by placing a 1 in a column that's located to the left of the mnemonic. Following 
calculation of the transpose, the resultant matrix is saved to the destination buffer using another series of 


vmovaps instructions. 


00000000 


00000000 
00000001 
00000005 


0000000A 


0000000F 


00000014 
00000018 


0000001D 


00000022 


00000027 
00000028 
0000002* 
00000033 
00000037 
0000003B 
0000003F 
00000043 


00000047 
00000048 


00000050 
00000055 
0000005A 
0000005A 
0000005F 
00000064 
00000067 


00000068 
00000069 


55 


48/ 83 
48/ 8D 


cs 


20 


F9/ 


EO 


c5 


c5 
c5 


c5 


c5 


c5 
c5 
c5 
c5 
C5 
c5 
C5 
c5 


cs 
cs 


cs 


c5 


c5 


C5 


F9/ 


FO 


F8/ 
F8/ 


10 


F8/ 


20 


F8/ 


30 


F8/ 
F8/ 
E8/ 
E8/ 
C8/ 
co/ 
F8/ 
F0/ 


F8/ 
F8/ 


10 


F8/ 


20 


F8/ 


30 


F9/ 


EO 


F9/ 


F0 


48/ 8B 


5D 


c3 


EC 


6c 


Tf 


JE 


28 
28 


28 


28 


14 
15 
14 
15 
16 
12 
16 
12 


29 
29 


29 


29 


6F 


6F 


E5 


20 


24 


75 


7D 


02 
4A 


52 


5A 


F1 
cl 
FB 
CB 
E7 
EE 
F1 
F8 


21 
69 


71 


79 


75 


7D 


AvxMat4x4TransposeF32_ proc frame 
_CreateFrame MT_,0,32 


1 push rbp 

1 sub rsp,StackSizeTotai 

1 lea rbp, [rsp+32] 
.SaveXmmRegs xmmó,xmm7 

1 vmovdqa xmmword ptr [rbp-ValNameOffsetSaveXmmhegs],xmm6 

1 vmovdqa xmmword ptr [rbp-ValNameOffsetSaveXmmRegs+16],xmm7 
_EndProlog 


; Transpose matrix m srcl 
vmovaps xmmO, [rdx] 
vmovaps xmml,[rdx-*16] 


vmovaps xmm2, [rdx!32] 


; xmmO 
; xmm1 


; xmm2 


= m src.row 0 


u 


m src.row 1 


- m src.row 2 


vmovaps xmm3, [rdx+48] ;xmm3 = m src.row 3 
_Mat4x4TransposeF32 
1 vunpcklps xmm6, xmm0, xmml ;xmm6 = bl al bO a0 
1 vunpckhps xmmO,xmmO, xmm1 ;xmm0 = b3 a3 b? a2 
1 vunpcklps xmm?,xmm2,xmm3 7xmm7 = dl cl dO cO 
1 vunpckhps xmml, xmm2, xmm3 ;xmml = d3 c3 d2 c2 
1 vmovlhps xmm4,xmmé,xmmT7 ;xmm4 = d0 cO bO ad 
1 vmovhlps xmm5, xmm7, xmm6 ;xmm5 = dl cl bl al 
1 vmovlhps xmm6,xmmO, xmn1 ;xmm6 = d2 c2 b2 a2 
1 vmovhlps xmm7,xmm1, xmm0 ;xmm7 = d3 c3 b2 a3 
vmovaps [rcx],xmm4 ¡save m des.row 0 
vmovaps [rcx*16],xmm5 ¿save m des.row 1l 
vmovaps [rcx*32], xmm6 ¿save m des.row 2 
vmovaps [rcx*48], xmm7 ;save m des.row 3 
Done:  RestoreXmmRegs xmm6, xmm? 
1 vmovdqa xmm6,xmmword ptr [rbp-ValNameOffsetSaveXmmRegs] 
1 vmovdqa xmm7,xmmword ptr [rbp-ValNameOffsetSaveXmmRegs+16] 
DeleteFrame 
1 mov rsp,rbp 
i pop rbp 
ret 
AvxMat4x4TransposeF32  endp 
end 


Figure 6-6. Expansion of macro. Mat4x4TransposeF32 
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Here is the output for source code example Ch06 07: 


Results for AvxMat4x4TransposeF32 


Matrix m src 


2.0 7.0 8.0 3.0 
11.0 14.0 16.0 10.0 
24.0 21.0 27.0 29.0 
31.0 34.0 38.0 33.0 
Matrix m desi 
2.0 11.0 24.0 31.0 
7.0 14.0 21.0 34.0 
8.0 16.0 27.0 38.0 
3.0 10.0 29.0 33.0 
Matrix m des2 
2.0 11.0 24.0 31.0 
7.0 14.0 21.0 34.0 
8.0 16.0 27.0 38.0 
3.0 10.0 29.0 33.0 


Running benchmark function AvxMat4x4TransposeF32 BM - please wait 
Benchmark times save to file Cho6 07 AvxMat4x4TransposeF32 BM CHROMIUM.csv 


Source code example Ch06 07 includes a function named AvxMat4x4TransposeF32 BM that contains 
code for measuring execution times of the C++ and assembly language matrix transposition functions. Most 
of the timing measurement code is encapsulated in a C++ class named BmThreadTimer. This class includes 
two member functions, BmThreadTimer::Start and BmThreadTimer::Stop, that implement a simple 
software stopwatch. Class BmThreadTimer also includes a member function named BmThreadTimer::SaveEl 
apsedTimes, which saves the timing measurements to a comma-separated text file. AvxMat4x4Transpose BM 
also uses a C++ class named OS. This class includes member functions that manage process and thread 
affinity. In the current example, 0S: : SetThreadAffinityMask selects a specific processor for benchmark 
thread execution. Doing this improves the accuracy of the timing measurements. The source code for classes 
BmThreadTimer and OS is not shown in Listing 6-7, but is included as part of the chapter download package. 

Table 6-1 contains matrix transposition timing measurements using several different Intel processors. 
The measurements were made using an EXE file that was built with the Visual C++ Release configuration 
and the default settings for code optimization except for the following options: AVX code generation 
(/axch:AVX) was selected to facilitate “apples-to-apples” comparisons between the C++ and x86-64 
assembly language code (the default code generation option for 64-bit Visual C++ is SSE2); whole 
program optimization was disabled. All timing measurements were made using ordinary desktop PCs 
running Windows 10. No attempt was made to account for any hardware, software, operating system, 
or configuration differences between the PCs prior to running the benchmark executable file. The test 
conditions described in this section are also used in subsequent chapters. 
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Table 6-1. Matrix Transposition Mean Execution Times (Microseconds), 1,000,000 Transpositions 


CPU C++ Assembly Language 
Intel Core i7-4790S 15885 2575 
Intel Core i9-7900X 13381 2203 
Intel Core i7-8700K 12216 1825 


The values shown in Table 6-1 were computed using the CSV file execution times and the Excel 
spreadsheet function TRIMMEAN(array,0.10). The assembly language implementation of the matrix 
transposition algorithm clearly outperforms the C++ version by a wide margin. It is not uncommon to 
achieve significant speed improvements using x86 assembly language, especially by algorithms that can 
exploit the SIMD parallelism of an x86 processor. You'll see additional examples of accelerated algorithmic 
performance throughout the remainder of this book. 

The benchmark timing measurements cited in this book provide reasonable approximations 
of function execution times. Like automobile fuel economy and battery runtime estimates, software 
performance benchmarking is not an exact science and subject to a variety of pitfalls. It is also important 
to keep mind that this book is an introductory primer about x86-64 assembly language programming and 
not benchmarking. The source code examples are structured to hasten the study of a new programming 
language and not maximum performance. In addition, the Visual C++ options described earlier were 
selected mostly for practical reasons and may not yield optimal performance in all cases. Like many 
high-level compilers, Visual C++ includes a plethora of code generation and speed options that can affect 
performance. Benchmark timing measurements should always be construed in a context that's correlated 
with the software's purpose. The methods described in this section are generally worthwhile, but results 
can vary. 


Matrix Multiplication 


The product of two matrices is defined as follows. Let A be an m x n matrix where m and n denote the 
number of rows and columns, respectively. Let B be an n x p matrix. Let C be the product of A and B, which 
is an m x p matrix. The value of each element c(i, j) in C can be calculated using the following equation: 


n-l 
C; = aby i=0,....m—1;j=0,...,p—1 
k=0 


Before proceeding to the sample code, a few comments are warranted. According to the definition of 
matrix multiplication, the number of columns in A must equal the number of rows in B. For example, if A is 
a 3 x 4 matrix and B is a 4 x 2 matrix, the product AB (a 3 x 2 matrix) can be calculated but the product BA is 
undefined. Note that the value of each c(i, j) in C is simply the dot product of row i in matrix A and column j 
in matrix B. The assembly language code will exploit this fact to perform matrix multiplications using packed 
AVX instructions. Also note that unlike most mathematical texts, the subscripts in the matrix multiplication 
equation use zero-based indexing. This simplifies translating the equation into C++ and assembly language 
code. 

Listing 6-8 shows the source code for example Ch06 08. This example demonstrates how to perform 
matrix multiplication using two 4 x 4 matrices of single-precision floating-point values. Similar to the 
previous example, main calls a function named AvxMat4x4Mu1F32 that exercises a matrix multiplication 
test case using functions written in C++ and assembly language. The template member function 
Matrix<float>: :Mul (source code not shown) carries out C++ matrix multiplication using the previously 
described equation. The assembly language function AvxMat4x4MulF32 uses SIMD arithmetic to perform 
matrix multiplication as you'll soon see. 
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Listing 6-8. Example Ch06 08 


#pragma once 


// Cho6 08 .asm 
extern "C" void AvxMat4x4MulF32 (float* m des, const float* m srci, const float* m src2); 


// Cho6 08 BM.cpp 
extern void AvxMat4x4MulF32 BM(void); 


#include "stdafx.h" 
include <iostream> 
#include «iomanip» 

#include "Cho6 08.h" 
#include "Matrix.h" 


using namespace std; 


void AvxMat4x4MulF32(Matrix<float>& m srci, Matrix<float>& m src2) 


{ 

const size t nr = 4; 

const size t nc = 4; 

Matrix«float» m desi(nr ,nc); 

Matrix«float» m des2(nr ,nc); 

Matrix«float»::Mul(m des1, m srci, m src2); 

AvxMat4x4MulF32 (m des2.Data(), m srci.Data(), m src2.Data()); 

cout «« fixed «« setprecision(1); 

m srci.SetOstream(12, " "); 

m src2.SetOstream(12, " "); 

m desi.SetOstream(12, " "); 

m des2.SetOstream(12, " "); 

cout << "\nResults for AvxMat4x4MulF32Nn"; 

cout << "Matrix m srciMn" << m srci << '\n'; 

cout << "Matrix m src2\n" << m src2 << ‘\n'; 

cout << "Matrix m des1\n" << m desi << 'Mn'; 

cout << "Matrix m des2\n" << m des2 << ‘\n'; 

if (m des1 != m des2) 

cout << "\nMatrix compare failed - AvxMat4x4MulF32\n"; 

} 
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int main() 


{ 
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const size t nr - 4; 
const size t nc - 4; 
Matrix«float» m srci(nr ,nc); 
Matrix«float» m src2(nr ,nc); 


const 
const 
const 
const 


const 
const 
const 
const 


m srci. 
m srci. 
m srci. 
m srci. 


m src2. 
m src2. 
m src2. 
m src2. 


float 
float 
float 
float 


float 
float 
float 
float 


srci rowO[] = { 10, 11, 12, 13 }; 
srci row1[] = { 20, 21, 22, 23 }; 
srci row2[] = { 30, 31, 32, 33 }; 
srci row3[] = { 40, 41, 42, 43 }; 
src2 rowO[] = { 100, 101, 102, 103 }; 
src2_row1[] = { 200, 201, 202, 203 }; 
src2 row2[] = { 300, 301, 302, 303 }; 
src2 row3[] = { 400, 401, 402, 403 }; 


SetRow(0, srci rowO); 
SetRow(1, srci row1); 
SetRow(2, srci row2); 
SetRow(3, srci row3); 


SetRow(0, src2 rowO); 
SetRow(1, src2_row1); 
SetRow(2, src2 row2); 
SetRow(3, src2 row3); 


AvxMat4x4MulF32(m srci, m src2); 
AvxMat4x4MulF32 BM(); 


return 


0; 


include «MacrosX86-64-AVX.asmh» 


Descript 


Register 


ion: 


S: 


; Mat4x4MulCalcRowF32 macro 


This macro is used to compute one row of a 4x4 matrix 
multiply. 


xmmo = m src2.rowO 

xmmi = m src2.rowi 

xmm2 - m src2.row2 

xmm3 - m src2.row3 

rcx - m des ptr 

rdx - m srci ptr 

xmm4 - xmm7 - scratch registers 
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_Mat4x4MulCalcRowF32 macro disp 
vbroadcastss xmm4,real4 ptr [rdx4disp] ;broadcast m srci[i][o] 
vbroadcastss xmm5,real4 ptr [rdx+disp+4] ;broadcast m srci[i][1] 
vbroadcastss xmm6,real4 ptr [rdx+disp+8] ;broadcast m srci[i][2] 
vbroadcastss xmm7,real4 ptr [rdx+disp+12]  ;broadcast m srci[i][3] 
vmulps xmm4,xmm4, xmmO ;m srci[i][0] * m src2.row 0 
vmulps xmm5,xmm5, xmm2 ;m srci[i][1] * m src2.row 1 
vmulps xmmó,xmm6, xmm2 ;m srci[i][2] * m src2.row 2 
vmulps xmm7,xmm7, xmm3 ;m srci[i][3] * m src2.row 3 
vaddps xmm4, xmm4, xmm5 ;calc m des.row i 
vaddps xmm6 ,xmm6 , xmm7 
vaddps xmm4,xmm4, xmm6 


vmovaps [rcx«disp], xmm4 
endm 


;save m des.row i 


; extern "C" void AvxMat4x4MulF32 (float* m des, const float* m src1, const float* m src2) 


; Description: The following function computes the product of two 
; single-precision floating-point 4x4 matrices. 


.code 

AvxMat4x4MulF32 proc frame 
_CreateFrame MM ,0,32 
_SaveXmmRegs xmm6,xmm7 
_EndProlog 


; Compute matrix product m des = m srci * m src2 


vmovaps xmmo, [r8] ;xmmO = m src2.row O 
vmovaps xmm1, [r8416] ;xmm1 = m src2.row 1 
vmovaps xmm2,[18+32] ;xmm2 = m src2.row 2 
vmovaps xmm3, [r8448] ;xmm3 = m src2.row 3 
_Mat4x4MulCalcRowF32 0 ;calculate m des.row O 
_Mat4x4MulCalcRowF32 16 ;calculate m des.row 1 
_Mat4x4MulCalcRowF32 32 ;calculate m des.row 2 
 Mat4x4MulCalcRowF32 48 ;calculate m des.row 3 


Done: | RestoreXmmRegs xmm6,xmm7 
_DeleteFrame 
ret 
AvxMat4x4MulF32  endp 
end 
a a a a Lu 
// Cho6 08 BM.cpp 
[| see nee aac a nome ale non en melee 
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#include "stdafx.h" 
#include <iostream> 
#include "Cho6 08.h" 
#include "Matrix.h" 
#include "BmThreadTimer.h" 
#include "OS.h" 


using namespace std; 


void AvxMat4x4MulF32 BM(void) 
{ 
OS: :SetThreadAffinityMask(); 
cout << "\nRunning benchmark function AvxMat4x4MulF32 BM - please wait\n"; 


const size t num rows - 4; 
const size t num cols - 4; 
Matrix«float» m srci(num rows, num cols); 
Matrix«float» m src2(num rows, num cols); 
Matrix«float» m desi(num rows, num cols); 
Matrix«float» m des2(num rows, num cols); 


= { 10, 11, 12, 13 }: 
= { 14, 15, 16, 17 }; 
= { 18, 19, 20, 21 }; 
= { 22, 23, 24, 25 }; 
=: 0; 1, 2, 3 }; 

= { 4, 5, 6, 7 }; 

= { 8, 9, 10, 11 h 
= { 12, 13, 14, 15 Jj 


const float m src1 ro[] 
const float m srci r1[] 
const float m src1 r2[] 
const float m src1 r3[] 
const float m src2 ro[] 
const float m src2 r1[] 
const float m src2 r2[] 
const float m src2 r3[] 
m srci.SetRow(0, m srci r0); 
m srci.SetRow(1, m srci r1); 
m srci.SetRow(2, m srci r2); 
m srci.SetRow(3, m srci r3); 
m src2.SetRow(0, m src2 r0); 
m src2.SetRow(1, m src2 r1); 
m src2.SetRow(2, m src2 r2); 
m src2.SetRow(3, m src2 r3); 


const size t num it - 500; 
const size t num alg - 2; 
const size t num ops - 1000000; 


BmThreadTimer bmtt(num it, num alg); 


for (size t i = 0; i < num it; i++) 
( 
bmtt.Start(i, 0); 
for (size t j = 0; j « num ops; j++) 
Matrix«float»::Mul(m desi, m srci, m src2); 
bmtt.Stop(i, 0); 
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bmtt.Start(i, 1); 

for (size t j = 0; j < num ops; j++) 
AvxMat4x4MulF32 (m des2.Data(), m srci.Data(), m src2.Data()); 

bmtt.Stop(i, 1); 


) 


string fn = bmtt.BuildCsvFilenameString("Cho6 08 AvxMat4x4MulF32 BM"); 
bmtt.SaveElapsedTimes(fn, BmThreadTimer::EtUnit::MicroSec, 2); 
cout << "Benchmark times save to file " << fn << '\n'; 


The standard technique for performing matrix multiplication requires three nested for loops that 
employ scalar floating-point multiplication and addition (see the code for Matrix<T> : :Mul in the header 
file Matrix.h). Figure 6-7 shows the explicit equations that can be used to calculate the elements of row 0 for 
the matrix product C - AB. Note that each row of matrix B is multiplied by the same element from matrix A. 
Similar sets of equations can be used to calculate rows 1, 2, and 3 of matrix C. The assembly language code in 
function AvxMatMul4x4F32 uses these equations to carry out matrix multiplication using SIMD arithmetic. 


C- AB 
Coo Co l Coz Co 3 doo A I doz doz by by, by M 
dy 4, 4 A; ||bo b, b, b, 
a 
&i @ Gal? 
Coo = Gg Dy + dybio F ab, t a,b, 
Cor = yb, + doi + 9,55, + M9355, 


Coo = Goo Ppp + doi 


b, + CRUE + aD, 


Coz = Gb, + aobis Fas, b,, + p35, 
row0 row1 row2 row3 


Figure 6-7. Equations for first row of matrix C = AB 


Following its prolog, AvxMatMul4x4F32 loads matrixm src2 (or B) into registers XMM0-XMM3. The 
next four lines use the macro. Mat4x4MulCalcRowF32 to calculate the products for rows 0-3 ofm des (or C). 
This macro implements the four equations that are shown in Figure 6-7. The macro parameter disp specifies 
which row to use. Macro. Mat4x4MulCalcRowF32 uses four vbroadcastss instructions to load the required 
elements from matrix m_src1 (or A) into registers XMM4-XMMT. It then uses four vmulps instructions to 
multiply these values by an entire row from matrixm src2.A series of vaddps instructions computes the final 
element values for the row. The vmovaps [rcx«disp],xmm4 instruction saves the entire row to the specified 
destination buffer. Here is the output for example Ch06 08: 
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Results for AvxMat4x4MulF32 


Matrix m srci 


10.0 11.0 12.0 13.0 
20.0 21.0 22.0 23.0 
30.0 31.0 32.0 33.0 
40.0 41.0 42.0 43.0 
Matrix m src2 
100.0 101.0 102.0 103.0 
200.0 201.0 202.0 203.0 
300.0 301.0 302.0 303.0 
400.0 401.0 402.0 403.0 
Matrix m desi 
12000.0 12046.0 12092.0 12138.0 
22000.0 22086.0 22172.0 22258.0 
32000.0 32126.0 32252.0 32378.0 
42000.0 42166.0 42332.0 42498.0 
Matrix m des2 
12000.0 12046.0 12092.0 12138.0 
22000.0 22086.0 22172.0 22258.0 
32000.0 32126.0 32252.0 32378.0 
42000.0 42166.0 42332.0 42498.0 


Running benchmark function AvxMat4x4MulF32 BM - please wait 
Benchmark times save to file Cho6 08 AvxMat4x4MulF32 BM CHROMIUM.csv 


Source code example Ch06 08 also includes a function named AvxMat4x4MulF32 BM that performs 
benchmark timing measurements of the matrix multiplication functions. Table 6-2 shows the timing 
measurements for several different Intel processors. These measurements were made using the procedure 
described in the previous section. 


Table 6-2. Matrix Multiplication Mean Execution Times (Microseconds), 1,000,000 Multiplications 


CPU C++ Assembly Language 
Intel Core i7-4790S 55195 5333 
Intel Core i9-7900X 46008 4897 
Intel Core i7-8700K 42260 4493 


213 


CHAPTER 6 = AVX PROGRAMMING — PACKED FLOATING-POINT 


Summary 


Here are the key learning points for Chapter 6: 


214 


The vaddp[d|s], vsubp[d|s], vmulp[d|s], vdivp[d|s], and vsqrtp[d|s] 
instructions carry out common arithmetic operation using packed double-precision 
and packed single-precision floating-point operands. 


The vcvtp[d|s]2dq and vcvtdq2p[d|s] instructions perform conversions between 
packed floating-point and packed signed-doubleword operands. The vcvtps2pd 
and vcvtpd2ps perform conversions between packed single-precision and double- 
precision operands. 


The vminp[d|s] and vmaxp[d|s] instructions perform packed minimum and 
maximum value calculations using double-precision and single-precision floating- 
point operands. 


The vbroadcasts[d|s] instructions broadcast (or copy) a single scalar double- 
precision or single-precision value to all element positions of an x86 SIMD register. 


Assembly language functions that use the vmovap[d|s] and vmovdqa instructions can 
only be used with operands in memory that are properly aligned. The MASM align 
16 directive aligns data items in a . const or .data section to a 16-byte boundary. 
C++ functions can use the alignas specifier to guarantee proper alignment. 


Assembly language functions can use the vunpck[h|1]p[d|s] instructions to 
accelerate common matrix operations, especially 4 x 4 matrices. 


Assembly language functions can use the vhaddp[d|s] and vshufp[d|s] instructions 
to perform data reductions of intermediate packed values. 


Many algorithms can achieve significant performance gains by using SIMD 
programming techniques and the x86-AVX instruction set. 


CHAPTER 7 


AVX Programming — Packed Integers 


In the previous chapter, you learned how to use the AVX instruction set to perform calculations using packed 
floating-point operands. In this chapter, you learn how to carry out computations using packed integer 
operands. Similar to the previous chapter, the first few source code examples in this chapter demonstrate 
basic arithmetic operations using packed integers. The remaining source code examples illustrate how 
to use the computational resources of AVX to perform common image processing operations, including 
histogram creation and thresholding. 

AVX supports packed integer operations using 128-bit wide operands, and that is the focus of the source 
code examples in this chapter. Performing packed integer operations using 256-bit operands requires a 
processor that supports AVX2. You learn about AVX2 programming with packed integers in Chapter 10. 


Packed Integer Addition and Subtraction 


Listing 7-1 shows the C++ and assembly language source code for example Ch07_01. This example 
demonstrates how to perform packed integer addition and subtraction using signed and unsigned 16-bit 
integers. It also illustrates both wraparound and saturated arithmetic. 


Listing 7-1. Example Ch07_01 


#include "stdafx.h" 
#include <iostream> 
#include <string> 

#include "XmmVal.h" 


using namespace std; 


extern "C" void AvxPackedAddI16 (const XmmVal& a, const XmmVal& b, XmmVal c[2]) 
extern "C" void AvxPackedSubI16 (const XmmVal& a, const XmmVal& b, XmmVal c[2]); 
extern "C" void AvxPackedAddU16 (const XmmVal& a, const XmmVal& b, XmmVal c[2]); 
extern "C" void AvxPackedSubU16 (const XmmVal& a, const XmmVal& b, XmmVal c[2]) 


3 


// 
// Signed packed addition and subtraction 
// 
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void AvxPackedAddI16(void) 


{ 
alignas(16) XmmVal a; 
alignas(16) XmmVal b; 
alignas(16) XmmVal c[2]; 
a.m I16[0] = 10; b.m I16[0] = 100; 
a.m I16[1] - 200; b.m I16[1] = -200; 
a.m I16[2] = 30; b.m I16[2] = 32760; 
a.m I16[3] = -32766; b.m I16[3] = -400; 
a.m I16[4] = 50; b.m I16[4] = 500; 
a.m I16[5] = 60; b.m I16[5] = -600; 
a.m I16[6] = 32000; b.m I16[6] = 1200; 
a.m I16[7] = -32000; b.m I16[7] = -950; 
AvxPackedAddI16 (a, b, c); 
cout << "\nResults for AxvPackedAddI16 - Wraparound Addition\n"; 
cout «« "a: " << a.ToStringI16() << ‘\n'; 
cout << "b: " << b.ToStringI16() << ‘\n'; 
cout << "c[o]: " << c[0].ToStringI16() << '\n'; 
cout << "\nResults for AxvPackedAddI16 - Saturated Addition\n"; 
cout << "a: " << a.ToStringI16() << ‘\n'; 
cout << "b: " << b.ToStringI16() << ‘\n'; 
cout << "c[1]: " << c[1].ToStringI16() << '\n'; 
} 
void AvxPackedSubI16(void) 
{ 


alignas(16) XmmVal a; 
alignas(16) XmmVal b; 
alignas(16) XmmVal c[2]; 


a.m I16[0] - 10; b.m I16[0] = 100; 
a.m I16[1] = 200; b.m I16[1] = -200; 
a.m I16[2] = -30; b.m I16[2] = 32760; 
a.m I16[3] = -32766; b.m I16[3] = 400; 
a.m I16[4] = 50; b.m I16[4] = 500; 
a.m I16[5] = 60; b.m I16[5] = -600; 
a.m I16[6] - 32000; b.m I16[6] = 1200; 
a.m I16[7] = -32000; b.m I16[7] = 950; 


AvxPackedSubI16 (a, b, c); 


cout << "\nResults for AxvPackedSubI16 - Wraparound Subtraction\n"; 


cout «« "a: " << a.ToStringI16() << ‘\n'; 

cout << "b: " << b.ToStringI16() << ‘\n'; 

cout << "c[o]: " << c[0].ToStringI16() << '\n'; 

cout << "\nResults for AxvPackedSubI16 - Saturated Subtraction\n"; 
cout << "a: " << a.ToStringI16() << ‘\n'; 
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cout << "b: 
cout << "c[1]: 


" << b.ToStringI16() << ‘\n'; 
" << c[1].ToStringI16() << ‘\n'; 


J 

// 

// Unsigned packed addition and subtraction 

// 

void AvxPackedAddU16(void) 

1 
XmmVal a; 
XmmVal b; 
XmmVal c[2]; 
a.m U16[0] = 10; b.m U16[0] = 100; 
a.m U16[1] = 200; b.m U16[1] = 200; 
a.m U16[2] = 300; b.m U16[2] = 65530; 
a.m U16[3] = 32766; b.m U16[3] = 40000; 
a.m U16[4] = 50; b.m U16[4] = 500; 
a.m U16[5] = 20000; b.m U16[5] = 25000; 
a.m U16[6] = 32000; b.m U16[6] = 1200; 
a.m U16[7] = 32000; b.m U16[7] = 50000; 
AvxPackedAddU16 (a, b, c); 
cout << "\nResults for AxvPackedAddU16 - Wraparound Addition\n"; 
cout << "a: " << a.ToStringU16() << '\n'; 
cout «« "b: " << b.ToStringU16() << '\n'; 
cout << "c[0]: " << c[o].ToStringU16() << 'Wn'; 
cout << "\nResults for AxvPackedAddU16 - Saturated Addition\n"; 
cout << "a: " << a.ToStringU16() << '\n'; 
cout << "b: " << b.ToStringU16() << '\n'; 
cout << "c[1]: " << c[1].ToStringU16() << '\n'; 

} 

void AvxPackedSubU16 (void) 

1 
XmmVal a; 
XmmVal b; 


XmmVal c[2]; 


a.m U16[0] = 10; b.m U16[0] = 100; 
a.m U16[1] - 200; b.m U16[1] = 200; 
a.m U16[2] = 30; b.m U16[2] = 7; 

a.m U16[3] = 65000; b.m U16[3] = 5000; 
a.m U16[4] = 60; b.m U16[4] = 500; 
a.m U16[5] = 25000; b.m U16[5] = 28000; 
a.m U16[6] = 32000; b.m U16[6] = 1200; 
a.m U16[7] = 1200; b.m U16[7] = 950; 
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AvxP 


cout 
cout 
cout 
cout 
cout 
cout 
cout 
cout 


int main 


stri 


AvxP 
AvxP 
cout 
AvxP 
AvxP 
retu 


3 
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ackedSubU16 (a, b, c); 


<< "AnResults for AxvPackedSubU16 - Wraparound Subtraction\n"; 
<< "a: " << a.ToStringU16() << ‘\n'; 
<< "b: " << b.ToStringU16() << ‘\n'; 
<< "c[o]: " << c[o].ToStringU16() << '\n'; 
<< "AnResults for AxvPackedSubI16 - Saturated Subtraction\n"; 
<< "at " << a.ToStringU16() << ‘\n'; 
<< "b: " << b.ToStringU16() << ‘\n'; 
<< "c[1]: " << c[1].ToStringU16() << ‘\n'; 
Q 
ng sep(75, '-'); 
ackedAddI16(); 
ackedSubI16(); 
<< '\n' << sep << ‘\n'; 
ackedAddU16( ) ; 
ackedSubU16() ; 
rn 0; 
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; extern "C" void AvxPackedAddI16_(const XmmVal& a, 


AvxPacke 


; Packed 


AvxPacke 


.code 
dAddl16 proc 


signed word addition 
vmovdqa xmmO,xmmword ptr [rcx] 
vmovdqa xmm1,xmmword ptr [rdx] 


vpaddw xmm2,xmmo, xmm1 
vpaddsw xmm3,xmmO, xmm1 


vmovdqa xmmword ptr [r8],xmm2 
vmovdqa xmmword ptr [r8+16],xmm3 
ret 

dAddli6 endp 


;xmmo 
;xmm1 


; packed 
; packed 


const XmmVal& b, XmmVal c[2]) 


a 
b 


add - wraparound 
add - saturated 


;save c[0] 
;save c[1] 


; extern "C" void AvxPackedSubI16 (const XmmVal& a, 


AvxPacke 


; Packed 


218 


dSubI16 proc 


signed word subtraction 
vmovdqa xmmo,xmmword ptr [rcx] 
vmovdqa xmm1,xmmword ptr [rdx] 


;xmmo 
;xmm1 


const XmmVal& b, XmmVal c[2]) 
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vpsubw xmm2,xmmo, xmm1 ;packed sub - 
vpsubsw xmm3,xmmo, xmm1 ;packed sub - 
vmovdqa xmmword ptr [r8],xmm2 ;save c[0] 
vmovdqa xmmword ptr [r8+16],xmm3 ;save c[1] 
ret 


AvxPackedSubI16  endp 


; extern "C" void AvxPackedAddU16 (const XmmVal& a, const 

AvxPackedAddU16 proc 

; Packed unsigned word addition 
vmovdqu xmmO,xmmword ptr [rcx] ;xmmo = a 
vmovdqu xmm1,xmmword ptr [rdx] ;xmm1 = b 
vpaddw xmm2,xmmo, xmm1 ;packed add - 
vpaddusw xmm3,xmmo, xmm1 ;packed add - 
vmovdqu xmmword ptr [r8],xmm2 ;save c[0] 
vmovdqu xmmword ptr [r8+16],xmm3 ;save c[1] 
ret 

AvxPackedAddU16  endp 

; extern "C" void AvxPackedSubU16 (const XmmVal& a, const 

AvxPackedSubU16 proc 

; Packed unsigned word subtraction 
vmovdqu xmmO,xmmword ptr [rcx] ;xmmo = a 
vmovdqu xmm1,xmmword ptr [rdx] ;xmm1 = b 
vpsubw xmm2 , xmmo , xmm31 ;packed sub - 
vpsubusw xmm3, xmmo , xmm1 ;packed sub - 
vmovdqu xmmword ptr [r8],xmm2 ;save c[0] 
vmovdqu xmmword ptr [r8+16],xmm3 ;save c[1] 
ret 

AvxPackedSubU16  endp 
end 


wraparound 
saturated 


XmmVal& b, XmmVal c[2]) 


wraparound 
saturated 


XmmVal& b, XmmVal c[2]) 


wraparound 
saturated 


Toward the top of the C++ code are the declarations for the assembly language functions that perform 
packed integer addition and subtraction. Each function takes two XmmVal arguments and saves its results 
to an XmmVal array. The structure XmmVal, which you learned about in Chapter 6 (see Listing 6-1), contains 
a publicly-accessible anonymous union with members that correspond to the packed data types that can 
be used with an XMM register. The XmmVal structure also defines several member functions that format the 


contents of an XmmVal for display. 
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The C++ function AvxPackedAdd116 contains test code that exercises the assembly language function 
AvxPackedAddI16 . This function performs packed signed 16-bit integer (word) addition using both 
wraparound and saturated arithmetic. Note that the XmmVal variables a, b, and c are all defined using the 
C++ specifier alignas(16), which aligns each XmmVal to a 16-byte boundary. Following the execution of 
the function AvxPackedadd116 , the results are displayed using a series of stream writes to cout. The C++ 
function AvxPackedSubI16, which is similar to AvxPackedAdd116, uses the assembly language function 
AvxPackedSubI16 . 

A parallel set of C++ functions, AvxPackedAddU16 and AvxPackedSubU16, contain code that exercise 
the assembly language functions AvxPackedAddU16 and AvxPackedSubU16 . These functions perform 
packed unsigned 16-bit integer addition and subtraction, respectively. Note that the XmmVal variables in 
AvxPackedAddU16 and AvxPackedSubU16 do not use the alignas(16) specifier, which means that these 
values are not guaranteed to be aligned on a 16-byte boundary. The reason for doing this is to demonstrate 
the use of the AVX instruction vmovdqu (Move Unaligned Packed Integer Values), as you'll soon see. 

The assembly language function AvxPackedAddI6 starts with a vmovdqa xmmO,xmmword ptr [rcx] 
instruction that loads argument value a into register XMMO. The ensuing vmovdqa xmm1,xmmword ptr [rdx] 
instruction copies b into register XMM1. The next two instructions, vpaddw xmm2, xmmO, xmm1 and vpaddsw 
xmm3 , xmmo , xmm1, carry out packed signed 16-bit integer addition using wrapround and saturated arithmetic, 
respectively. The final two vmovdqa instructions save the calculated results to XmmVal array c. Assembly 
language function AvxPackedSubI16 is similar to AvxPackedAddI16 and uses the instructions vpsubw and 
vpsubsw to carry out packed signed 16-bit integer subtraction. 

The assembly language function AvxPackedAddU16 begins with a vmovdqu xmmo,xmmword ptr [rcx] 
instruction that loads a into register XMMO. A vmovdqu instruction is used here since XmmVal a was defined 
in the C++ code without the alignas(16) specifier. Note that function AvxPackedAddU16 uses vmovdqu 
for demonstration purposes only; a properly aligned XmmVal and a vmovdqa instruction should have been 
used instead. It's already been mentioned a number of times in this book but warrants repeating due to its 
importance: SIMD operands in memory should be properly aligned whenever possible in order to avoid 
potential performance penalties that can occur whenever the processor accesses an unaligned operand in 
memory. 

Following the loading of argument values a and b into register XMMO0 and XMMI, function 
AvxPackedAddU16 performs packed unsigned 16-bit integer addition using the instructions vpaddw 
xmm2 , xmmO, xmm1 (wraparound arithmetic) and vpaddusw xmm3,xmmO,xmm1 (saturated arithmetic). Two 
vmovdqu instructions save the results to array c. The function AvxPackedSubU16 implements packed 
unsigned 16-bit integer subtraction using the vpsubw and vpsubusw instructions. This function also uses the 
vmovdqu instruction to load argument values and save results. Here are the results for source code example 
Ch07 01: 


Results for AxvPackedAddI16 - Wraparound Addition 


a: 10 200 30 -32766 50 60 32000 -32000 
b: 100 -200 32760 -400 500 -600 1200 -950 
c[o]: 110 O -32746 32370 550 -540 -32336 32586 
Results for AxvPackedAddI16 - Saturated Addition 

a: 10 200 30 -32766 50 60 32000 -32000 
b: 100 -200 32760 -400 500 -600 1200 -950 
c[1]: 110 O 32767 -32768 550 -540 32767 -32768 
Results for AxvPackedSubI16 - Wraparound Subtraction 

a: 10 200 -30 -32766 50 60 32000 -32000 
b: 100 -200 32760 400 500 -600 1200 950 
c[0]: -90 400 32746 32370 -450 660 30800 32586 
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Results for AxvPackedSubI16 - Saturated Subtraction 
a: 10 200 -30 -32766 | 50 
b: 100 -200 32760 400 | 500 
ef]: -90 400 -32768 -32768 | -450 
Results for AxvPackedAddU16 - Wraparound Addition 

a: 10 200 300 32766 50 
b: 100 200 65530 40000 500 
c[0]: 110 400 294 7230 550 
Results for AxvPackedAddU16 - Saturated Addition 

a: 10 200 300 32766 50 
b: 100 200 65530 40000 500 
c[1]: 110 400 65535 65535 550 
Results for AxvPackedSubU16 - Wraparound Subtraction 
a: 10 200 30 65000 60 
b: 100 200 7 5000 500 
c[0]: 65446 0 23 60000 65096 
Results for AxvPackedSubI16 - Saturated Subtraction 
a: 10 200 30 65000 60 
b: 100 200 7 5000 500 
c[1]: 0 0 23 60000 0 


60 32000 -32000 
-600 1200 950 
660 30800 -32768 
20000 32000 32000 
25000 1200 50000 
45000 33200 16464 
20000 32000 32000 
25000 1200 50000 
45000 33200 65535 
25000 32000 1200 
28000 1200 950 
62536 | 30800 250 
25000 32000 1200 
28000 1200 950 
O 30800 250 


AVX also supports packed integer addition and subtraction using 8-, 32-, and 64-bit integers. The 
vpaddb, vpaddsb, vpaddusb, vpsubb, vpsubsb, and vpsubusb instructions are the 8-bit (byte) versions of 
the packed 16-bit instructions that were demonstrated in this example. The vpadd[d|q] and vpsub[d|q] 
instructions can be employed to perform packed 32-bit (doubleword) or 64-bit (quadword) addition and 
subtraction using wraparound arithmetic. AVX does not support saturated addition and subtraction using 


packed doubleword or quadword integers. 


Packed Integer Shifts 


The next source code example is named Ch07 02. This example illustrates how to perform logical and 
arithmetic shift operations using packed integer operands. Listing 7-2 shows the C++ and assembly language 


source code for example Ch07 02. 


Listing 7-2. Example Ch07 02 


#include "stdafx.h" 
#include <iostream> 
#include "XmmVal.h" 
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using namespace std; 


// The order of the name constants in the following enum must 
// correspond to the table values defined in .asm file. 


enum ShiftOp : unsigned int 


{ 

U16 LL, // shift left logical - word 

U16 RL, // shift right logical - word 

U16 RA, // shift right arithmetic - word 

U32 LL, // shift left logical - doubleword 

U32 RL, // shift right logical - doubleword 

U32 RA, // shift right arithmetic - doubleword 
»h 


extern "C" bool AvxPackedIntegerShift (XmmVal& b, const XmmVal& a, ShiftOp shift op, 
unsigned int count); 


void AvxPackedIntegerShiftU16 (void) 


{ 
unsigned int count = 2; 
alignas(16) XmmVal a; 
alignas(16) XmmVal b; 
a.m U16[0] = 0x1234; 
a.m U16[1] = oxFFoo; 
a.m U16[2] = Ox00CC; 
a.m U16[3] = 0x8080; 
a.m U16[4] = OxOOFF; 
a.m U16[5] = OxAAAA; 
a.m U16[6] = OxOFOF; 
a.m U16[7] = 0x0101; 
AvxPackedIntegerShift (b, a, U16 LL, count); 
cout << "\nResults for ShiftOp::U16 LL (count = " << count << ")\n"; 
cout << "a: " << a.ToStringX16() << ‘\n'; 
cout << "b: " << b.ToStringX16() << '\n'; 
AvxPackedIntegerShift (b, a, U16 RL, count); 
cout << "\nResults for ShiftOp::U16 RL (count = " << count << ")\n"; 
cout << "a: " << a.ToStringX16() << ‘\n'; 
cout << "b: " << b.ToStringX16() << '\n'; 
AvxPackedIntegerShift (b, a, U16 RA, count); 
cout << "\nResults for ShiftOp::U16 RA (count = " << count << ")\n"; 
cout << "a: " << a.ToStringX16() << '\n'; 
cout << "b: " << b.ToStringX16() << '\n'; 
} 
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void AvxPackedIntegerShiftU32 (void) 


{ 
unsigned int count = 4; 
alignas(16) XmmVal a; 
alignas(16) XmmVal b; 
a.m_U32[0] = 0x12345678; 
a.m U32[1] = OxFFOOFFOO; 
a.m_U32[2] = 0x03030303; 
a.m U32[3] = ox80800FO0F; 
AvxPackedIntegerShift (b, a, U32 LL, count); 
cout << "\nResults for ShiftOp::U32 LL (count = " << count << ")\n"; 
cout << "a: " << a.ToStringX32() << ‘\n'; 
cout << "b: " << b.ToStringX32() << ‘\n'; 
AvxPackedIntegerShift (b, a, U32 RL, count); 
cout << "\nResults for ShiftOp::U32 RL (count = " << count << ")\n"; 
cout << "a: " << a.ToStringX32() << ‘\n'; 
cout << "b: " << b.ToStringX32() << ‘\n'; 
AvxPackedIntegerShift (b, a, U32 RA, count); 
cout << "AnResults for ShiftOp::U32 RA (count = " << count << ")\n"; 
cout << "a: " << a.ToStringX32() << ‘\n'; 
cout << "b: " << b.ToStringX32() << ‘\n'; 
} 
int main(void) 
{ 
string sep(75, '-'); 
AvxPackedIntegerShiftU16(); 
cout << 'An' << sep << '\n'; 
AvxPackedIntegerShiftU32(); 
return 0; 
) 
—— —— —— eee 
; Ch07 02.asm 


3 


; extern "C" bool AvxPackedIntegerShift (XmmVal& b, const XmmVal& a, ShiftOp shift op, 
unsigned int count) 


3 


; Returns: O = invalid shift op argument, 1 = success 
5 
; Note: This module requires linker option /LARGEADDRESSAWARE : NO 
; to be explicitly set. 
. code 


223 


CHAPTER 7 bn AVX PROGRAMMING — PACKED INTEGERS 


AvxPackedIntegerShift proc 
; Make sure ‘shift op' is valid 


mov r8d,r8d ;zero extend shift op 
cmp r8,ShiftOpTableCount ;compare against table count 
jae Error ;jump if shift op is invalid 


; Jump to the operation specified by shift op 
vmovdqa xmmO,xmmword ptr [rdx] ;xmmO = a 
vmovd xmm1,r9d ;xmm1[31:0] = shift count 
mov eax,1 ;set success return code 
jmp [ShiftOpTable+r8*8] 


; Packed shift left logical - word 

U16 LL: vpsllw xmm2,xmmO, xmm1 
vmovdqa xmmword ptr [rcx],xmm2 
ret 


; Packed shift right logical - word 
U16 RL: vpsrlw xmm2, xmmo, xmm31 
vmovdqa xmmword ptr [rcx],xmm2 
ret 


; Packed shift right arithmetic - word 
U16 RA: vpsraw xmm2,xmmo,xmm1 
vmovdqa xmmword ptr [rcx],xmm2 
ret 


; Packed shift left logical - doubleword 
U32 LL: vpslld xmm2,xmmO, xmm1 
vmovdqa xmmword ptr [rcx],xmm2 
ret 


; Packed shift right logical - doubleword 
U32 RL: vpsrld xmm2, xmmo, xmmi 
vmovdqa xmmword ptr [rcx],xmm2 
ret 


; Packed shift right arithmetic - doubleword 
U32 RA: vpsrad xmm2,xmmO, xmm1 

vmovdqa xmmword ptr [rcx],xmm2 

ret 


Error: xor eax,eax ;set error code 
vpxor xmmO, xmmO , xmmO 
vmovdqa xmmword ptr [rcx],xmmO ;set result to zero 
ret 


; The order of the labels in the following table must correspond 
; to the enums that are defined in .cpp file. 
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align 8 
ShiftOpTable qword U16 LL, U16 RL, U16 RA 

qword U32 LL, U32 RL, U32 RA 
ShiftOpTableCount equ ($ - ShiftOpTable) / size qword 


AvxPackedIntegerShift_ endp 
end 


The C++ code that’s shown in Listing 7-2 begins with the definition of an enum named ShiftOp, which is 
used to select a shift operation. Supported shift operations include logical left, logical right, and arithmetic 
right using packed word and doubleword values. Following enum ShiftOp is the declaration for the function 
AvxPackedIntegerShift_. This function carries out the requested shift operation using the supplied 
XmmVal argument and the specified count value. The C++ functions AvxPackedIntegerShiftU16 and 
AvxPackedIntegerShiftU32 initialize test cases for performing various shift operations using packed words 
and doublewords, respectively. 

Assembly language function AvxPackedIntegerShift uses a jump table to execute the specified 
shift operation. This is similar to what you saw in source code examples Ch05_06 (Chapter 5) and Ch06 03 
(Chapter 6). Upon entry to AvxPackedIntegerShift_, the argument value shift_op is tested for validity. 
Following validation of shift op,a vmovdqa xmmO,xmmword ptr [rdx] instruction loads a into register 
XMMO. The subsequent vmovd xmm1, r9d instruction copies argument value count into the low-order 
doubleword of register XMMI. This is followed by a jmp [ShiftOpTable4r8*8] instruction that transfers 
program control to the appropriate code block. 

Each distinct code block in AvxPackedIntegerShift performs a particular shift operation. For 
example, the code block adjacent to the label U16 LL uses the AVX instruction vpsllw xmm2,xmmO,xmm1 to 
perform a logical left shift using packed words. It is important to note that every word element in XMMO is 
independently shifted left by the number of bits specified in XMM1[31:0]. The code blocks adjacent to the 
labels U16 RL and U16 RA carry out logical and arithmetic right shifts of packed words using the instructions 
vpsrlw and vpsraw, respectively. The function AvxPackedIntegerShift employs a similar structure to 
perform packed shift operations on doublewords using the instructions vpslld, vpslrd, and vpsrad. All of 
the code blocks in AvxPackedIntegerShift conclude with a vmovdga xmmword ptr [rcx],xmm2 instruction 
that saves the calculated result. Here is the output for source code example Ch07 02: 


Results for ShiftOp::U16 LL (count = 2) 
a: 1234 FFOO oocc 8080 OOFF AAAA OFOF 0101 
b: 48D0 FCOO 0330 0200 03FC AAA8 3C3C 0404 


Results for ShiftOp::U16 RL (count = 2) 
a: 1234 FFOO 00CC 8080 OOFF AAAA OFOF 0101 
b: 048D 3FCO 0033 2020 003F 2AAA 03C3 0040 


Results for ShiftOp::U16 RA (count = 2) 
a: 1234 FFOO oocc 8080 OOFF AAAA OFOF 0101 
b: 048D FFCO 0033 E020 003F EAAA 03C3 0040 


Results for ShiftOp::U32 LL (count = 4) 
a: 12345678 FFOOFFOO | 03030303 80800F0F 
b: 23456780 FOOFFOOO | 30303030 0800F0F0 
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Results for ShiftOp::U32 RL (count = 4) 
a: 12345678 FFOOFFOO | 03030303 80800F0F 
b: 01234567 OFFOOFFO | 00303030 080800F0 


Results for ShiftOp::U32 RA (count = 4) 
a: 12345678 FFOOFFOO | 03030303 80800F0F 
b: 01234567 FFFOOFFO | 00303030 F80800F0 


The AVX instructions vpsllq, vpslrq, and vpsraq can be used to perform shift operations using packed 
quadwords. Somewhat surprisingly, AVX does not support shift operations using packed byte operands. AVX 
also includes the shift instructions vps[1|r]dq, which carry out logical left or logical right byte shifts of a 
128-bit wide operand in an XMM register. You'll see how these instructions work in the next section. 


Packed Integer Multiplication 


Besides packed integer addition and subtraction, AVX also includes instructions that perform packed 
integer multiplication. These instructions are slightly different than the corresponding packed addition and 
subtraction instructions. Part of this difference is due to the fact that in order to calculate a non-truncated 
product, integer multiplication requires a destination operand to be twice the size of the original source 
operands. For example, the non-truncated product of two signed 16-bit integers is a signed 32-bit integer. 
Listing 7-3 shows the source code for example Ch07 03. This example demonstrates how to perform packed 
integer multiplication using signed 16-bit and 32-bit integers. 


Listing 7-3. Example Ch07 03 


#include "stdafx.h" 
#include <iostream> 
#include "XmmVal.h" 


using namespace std; 


extern "C" void AvxPackedMul116 (XmmVal c[2], const XmmVal& a, const XmmVal& b); 
extern "C" void AvxPackedMulI32A (XmmVal c[2], const XmmVal& a, const XmmVal& b); 
extern "C" void AvxPackedMulI32B (XmmVal* c, const XmmVal& a, const XmmVal& b); 


void AvxPackedMu1116(void) 

( 
alignas(16) XmmVal a; 
alignas(16) XmmVal b; 
alignas(16) XmmVal c[2]; 


a.m I16[0] = 10; b.m I16[0] = -5; 
a.m I16[1] = 3000; b.m I16[1] = 100; 
a.m I16[2] = -2000; b.m I16[2] = -9000; 
a.m I16[3] = 42; b.m I16[3] = 1000; 
a.m I16[4] = -5000; b.m I16[4] = 25000; 
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} 


a.m 116[5] = 8; b.m 116[5] = 16384; 
a.m 116[6] = 10000; b.m I16[6] = 3500; 
a.m 116[7] = -60; b.m 116[7] = 6000; 


AvxPackedMu1116 (c, a, b); 
cout << "\nResults for AvxPackedMulI16\n"; 


for (size t i = 0; i < 8; i++) 
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{ 
cout << "a[" << i << "J: " << setw(8) << a.m I16[i] «« " "5 
cout << "b[" << i << "J: " << setw(8) << b.m I16[i] << " "5 
if (i < 4) 
{ 
cout << "c[o][" << i << "]: "5 
cout << setw(12) << c[0].m I32[i] << ‘\n'; 
} 
else 
{ 
cout << "c[1][" << i - 4 << "]: "; 
cout << setw(12) << c[1].m I32[i - 4] << '\n'; 
} 


void AvxPackedMulI32A(void) 


{ 


alignas(16) XmmVal a; 
alignas(16) XmmVal b; 
alignas(16) XmmVal c[2]; 


a.m I32[0] = 10; b.m I32[0] = -500; 
a.m I32[1] = 3000; b.m I32[1] = 100; 
a.m I32[2] = -40000; b.m I32[2] = -120000; 
a.m I32[3] = 4200; b.m I32[3] = 1000; 


AvxPackedMulI32A (c, a, b); 
cout << "\nResults for AvxPackedMulI32A\n"; 


for (size t i = 0; i < 4; i++) 


{ 
cout << "a[" << i << "]: " << setw(10) << a.m I32[i] << " "5j 
cout << "b[" << i << "]: " << setw(10) << b.m I32[i] << " "5 
if (i < 2) 
{ 


cout << "c[o][" << i << "]: "5 
cout << setw(14) << c[0].m I64[i] << ‘\n'; 
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else 


{ " 


cout << "c[1][" << i - 2 << "J: 5 
cout << setw(14) << c[1].m I64[i - 2] << ‘\n'; 


} 
} 
} 
void AvxPackedMulI32B(void) 
{ 
alignas(16) XmmVal a; 
alignas(16) XmmVal b; 
alignas(16) XmmVal c; 
a.m I32[0] = 10; b.m I32[0] = -500; 
a.m I32[1] = 3000; b.m I32[1] = 100; 
a.m I32[2] = -2000; b.m I32[2] = -12000; 
a.m I32[3] = 4200; b.m I32[3] = 1000; 
AvxPackedMulI32B (&c, a, b); 
cout << "\nResults for AvxPackedMulI32B\n"; 
for (size t i = 0; i« 4; i++) 
{ 
cout << "a[" << i << "]: " << setw(10) << a.m I32[i] «« " "; 
cout << "b[" << i << "]: " << setw(10) << b.m I32[i] <<" "3 
cout << "c[" << i << "]: " << setw(10) << c.m I32[i] << ‘\n'; 
} 
} 
int main() 
{ 
string sep(75, '-'); 
AvxPackedMulI16(); 
cout << '\n' << sep << '\n'; 
AvxPackedMulI32A(); 
cout << '\n' << sep << ‘\n'; 
AvxPackedMu1132B(); 
return 0; 
j 
> mm 
; Ch07 03.asm 
$ mm 


; extern "C" void AvxPackedMulli6 (XmmVal c[2], const XmmVal* a, const XmmVal* b) 


.code 

AvxPackedMull16 proc 
vmovdqa xmmo,xmmword ptr [rdx] ;xmmo = a 
vmovdqa xmm1,xmmword ptr [r8] ;xmm1 = b 
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vpmullw xmm2,xmmO, xmm1 
vpmulhw xmm3,xmmO, xmm1 


vpunpcklwd xmm4,xmm2,xmm3 
vpunpckhwd xmm5, xmm2, xmm3 


vmovdqa xmmword ptr [rcx],xmm4 
vmovdqa xmmword ptr [rcx+16],xmm5 
ret 

AvxPackedMulli16  endp 


; extern "C" void AvxPackedMulI32A (XmmVal c[2], const XmmVal* a, const XmmVal* b) 


AvxPackedMulI32A_ proc 


performs following operations: 


xmm2 [63:0] 
xmm2 [127:64] 


xmmO[31:0] * xmm1[31:0] 
xmmO[ 95:64] * xmm1[95:64] 


We Le be Le ve 


vmovdqa xmmO,xmmword ptr [rdx] 
vmovdqa xmm1,xmmword ptr [r8] 
vpmuldq xmm2,xmmO, xmm1 
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;xmm2 = packed a * b low result 
;xmm3 - packed a * b high result 


;merge low and high results 
;into final signed dwords 


;save final results 


Perform packed signed dword multiplication. Note that vpmuldq 


M 
£D 


;xmmo 
;xmm1 


LU 
o 


; Shift source operands right by 4 bytes and repeat vpmuldq 


vpsrldq xmmo,xmmo, 4 
vpsrldq xmmi,xmmi, 4 
vpmuldq xmm3,xmmO, xmm1 


; Save results 
vpextrq qword ptr 
vpextrq qword ptr 
vpextrq qword ptr 
vpextrq qword ptr 
ret 
AvxPackedMulI32A_ endp 


rcx],xmm2,0 

rcx«8], xmm3 , 0 
ICX*16],xmm2 , 1 
ICX*24],xmm3, 1 


[ 
[ 
[ 
[ 


;save xmm2[63:0] 
;save xmm3[63:0] 
;save xmm2[127:63] 
;save xmm3[127:63] 


; extern "C" void AvxPackedMulI32B (XmmVal*, const XmmVal* a, const XmmVal* b) 


AvxPackedMull32B proc 


; Perform packed signed integer multiplication and save low packed dword result 


vmovdqa xmmO,xmmword ptr [rdx] 
vpmulld xmmi,xmmo,xmmword ptr [r8] 
vmovdqa xmmword ptr [rcx],xmmi 
ret 

AvxPackedMullI32B  endp 
end 


;xmmo = a 
;xmm1i = packed a * b 
;save packed dword result 
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The C++ function AvxPackedMu1116 contains code that initializes XmmVal variables a and b using signed 
16-bit integers. This function then invokes the assembly language function AxvPackedMull16 , which 
performs packed multiplication using signed 16-bit integers. The results are then streamed to cout. Note 
that the results displayed by function AvxPackedMulI16 are signed 32-bit integer products. The other two 
C++ functions in Listing 7-3, AvxPackedMu1132A and AvxPackedMu132B, initialize test cases for performing 
packed signed 32-bit integer multiplication. The former of these functions computes a packed signed 64-bit 
integer product, while the latter calculates a packed signed 32-bit integer product. 

The assembly language function AvxPackedMull16 begins with two vmovdga instructions that 
load argument values a and b into registers XMMO0 and XMMI, respectively. The ensuing vpmullw 
xmm2 , xmmo , xmm1 instruction multiplies the packed signed 16-bit integers in XMMO0 and XMM1 and saves 
the low-order 16 bits of each 32-bit product in XMM2. This is followed by a vpmulhw xmm3 ,xmmO, xmm1 
instruction that calculates and saves the high-order 16 bits of each 32-bit product. The next two instructions, 
vpunpcklwd xmm4,xmm2,xmm3 and vpunpckhwd xmm5,xmm2, xmm3, create the final packed 32-bit signed 
integer products by interleaving the low-order (vpunpcklud) or high-order (vpunpckhwd) words of their 
source operands. Figure 7-1 illustrates the instruction sequence that’s employed by AvxPackedMull16 . 


Initial values 


0x2710 | 0x0008 Ox002a | Oxf830 | OxObb8 | Ox000a | xmmO 
0x1770 Ox4000 | Ox61a8 | Ox03e8 | Oxdcd8 | 0x0064 xmmi 


vpmullw xmm2,xmmO, xmml ;low-order products 

0x81c0 0x0000 Oxa410 | Oxa880 | 0x93e0 ETS xmm2 
vpmulhw xmm3,xmmO, xmm1 ;high-order products 

ET 0x0216 | 0x0002 | Oxf88c | Ox0000 | 0x0112 | 0x0004 xmm3 
vpunpcklwd xmm4, xmm2, xmm3 ;unpack low words to dwords 


xmm4 


0x0000a410 0x0112a880 0x000493e0 


vpunpckhwd xmm5, xmm2, xmm3 ;unpack high words to dwords 


Oxfffa81cO 0x02160ec0 0x00020000 Oxf88ca6cO xmms 


Figure 7-1. Instruction sequence used in AvxPackedMulI16 to perform packed 16-bit signed integer 
multiplication 


The next assembly language function in Listing 7-3, AvxPackedMu11324A , performs packed signed 32- 
bit integer multiplication. This function begins with two vmovdqa instructions that load XmmVal argument 
values a and b into registers XMMO and XMMI, respectively. The vpmuldq xmm2,xmmo,xmm1 instruction 
that follows performs packed signed 32-bit multiplication using the even numbered elements of the two 
source operands. It then saves the signed 64-bit products in XMM2. Two vpsrldq instructions are then used 
to right shift by four bytes the contents of registers XMMO0 and XMM1. This is followed by another vpmuldq 
instruction that calculates the remaining 64-bit products. Figure 7-2 show the execution details of this 
instruction sequence. 


D 
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Initial values 


vpmuldq xmm2,xmm0,xmml ;packed dword multiplication 
vpsrldq xmmO,xmmO, 4 double qword shift right (4 bytes) 


vpsrldq xmml,xmml,4 ;double qword shift right (4 bytes) 


1000 -120000 100 xmm1 


vpmuldq xmm3,xmm0,xmml ;packed dword multiplication 


300000 xmm3 


4200000 


Figure 7-2. Execution of vpnuldq and vpsrldq instructions 


Following the execution of the second vpmuldg instruction, registers XMM2 and XMMG contain the 
four signed 64-bit products. These values are then saved to the specified destination buffer using a series 
of vpextrq (Extract Quadword) instructions. This instruction copies the quadword element that's specified 
by the immediate (or second source) operand from the first source operand and saves it to the destination 
operand. For example, the instruction vpextrq qword ptr [rcx],xmm2,0 saves the low-order quadword of 
XMM2 to the memory location specified by RCX. The first source operand of a vpextrq instruction must be 
an XMM register; the destination operand can be a general-purpose register or a memory location. AVX also 
includes instructions that you can use to extract byte (vpextrb), word (vpextrw), or doubleword (vpextrd) 
elements. 

The final assembly language function in this source code example is named AvxPackedMu1132B . This 
function also performs packed signed 32-bit integer multiplication but saves truncated 32-bit products. 
Function AvxPackedMulI32B uses the vpmulld instruction that performs element-by-element doubleword 
multiplication similar to packed addition or subtraction. The low-order 32 bits of each product are then 
saved to the destination operand. The results for source code example Ch07 03 are as follows: 


Results for AvxPackedMull16 


a[0]: 10 b[o0]: -5 c[o][o]: -50 
a[1]: 3000 b[1]: 100 c[o][1]: 300000 
a[2]: -2000 b[2]: -9000 c[0][2]: 18000000 
a[3]: 42 b[3]: 1000 c[0][3]: 42000 
a[4]: -5000 b[4]: 25000 c[1][0]: -125000000 
a[5]: 8 b[s]: 16384 c[1][1]: 131072 
a[6]: 10000 b[6]: 3500 c[1][2]: 35000000 
a[7]: -60 b[7]: 6000 c[1][3]: -360000 
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Results for AvxPackedMulI32A 


a[0]: 10 b[o]: -500 c[o][o]: -5000 
a[1]: 3000 b[1] 100 c[0][1]: 300000 
a[2]: -40000 b[2]: -120000 c[1][0]: 4800000000 
a[3]: 4200 bf3] 1000 c[1][1]: 4200000 
Results for AvxPackedMulI32B 

a[0]: 10 b[o]: -500 c[o] -5000 

a[1]: 3000 b[1] 100 c[1] 300000 

a[2]: -2000 b[2] -12000 c[2]: 24000000 

a[3]: 4200 bf3] 1000 c[3] 4200000 


Packed Integer Image Processing 


The source code examples presented thus far were intended to familiarize you with AVX packed integer 
programming. Each example included a simple assembly language function that demonstrated the 
operation of several AVX instructions using instances of the structure XmmVal. For some real-world 
application programs, it may be appropriate to create a small set of functions similar to the ones you've seen 
thus far. However, in order to fully exploit the benefits of the AVX, you need to code functions that implement 
complete algorithms using common data structures. 

The source code examples in this section present algorithms that process arrays of unsigned 8-bit 
integers using the AVX instruction set. In the first example, you learn how to determine the minimum 
and maximum value of an array. This sample program has a certain practicality to it since digital images 
often use arrays of unsigned 8-bit integers to represent images in memory, and many image-processing 
algorithms (e.g., contrast enhancement) often need to determine the minimum (darkest) and maximum 
(lightest) pixels in an image. The second sample program illustrates how to calculate the mean value of 
an array of unsigned 8-bit integers. This is another example of a realistic algorithm that is directly relevant 
to the province of image processing. The final three source code examples implement universal image 
processing algorithms, including pixel conversion, histogram creation, and thresholding. 


Pixel Minimum-Maximum Values 


Source code example Ch07 04, shown in Listing 7-4, demonstrates how to find the minimum and maximum 
values in an array of unsigned 8-bit integers. This example also explains how to dynamically allocate aligned 
storage space for an array. 


Listing 7-4. Example Ch07 04 


#pragma once 
#include «cstdint» 
#include «malloc.h» 
#include <stdexcept> 
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class AlignedMem 


1 
public: 
static void* Allocate(size t mem size, size t mem alignment) 
{ 
void* p = aligned malloc(mem size, mem alignment); 
if (p == NULL) 
throw std::runtime error("Memory allocation error: AllocateAlignedMem()"); 
return p; 
} 
static void Release(void* p) 
{ 
_aligned_free(p); 
} 
template «typename T» static bool IsAligned(const T* p, size t alignment) 
if (p == nullptr) 
return false; 
if (((uintptr t)p % alignment) !- 0) 
return false; 
return true; 
} 
B 
template <class T> class AlignedArray 
{ 
T* m Data; 
size t m Size; 
public: 


AlignedArray(void) = delete; 

AlignedArray(const AlignedArray& aa) = delete; 
AlignedArray(AlignedArray&& aa) = delete; 

AlignedArray& operator = (const AlignedArray& aa) = delete; 
AlignedArray& operator = (AlignedArray&& aa) = delete; 


AlignedArray(size t size, size t alignment) 


m Size - size; 
m Data - (T*)AlignedMem::Allocate(size * sizeof(T), alignment); 
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~AlignedArray() 
{ 
AlignedMem: :Release(m Data); 
} 
T* Data(void) { return m Data; } 
size t Size(void) { return m Size; } 


void Fill(T val) 


{ 
for (size t i = 0; i < m Size; i++) 
m Data[i] - val; 

) 
F 
EE se 
// Cho7_04.h 
D TE EE 


#pragma once 
#include <cstdint> 


// Ch07 04.cpp 
extern void Init(uint8 t* x, size t n, unsigned int seed); 
extern bool AvxCalcMinMaxU8Cpp(const uint8 t* x, size t n, uint8 t* x min, uint8 t* x max); 


// Cho7 04 BM.cpp 
extern void AvxCalcMinMaxU8 BM(void); 


// Ch07 04 .asm 
extern "C" bool AvxCalcMinMaxU8 (const uint8 t* x, size t n, uint8 t* x min, uint8 t* x - 
max); 


// c NumElements must be » O and even multiple of 64 
const size t c NumElements - 16 * 1024 * 1024; 
const unsigned int c RngSeedVal - 23; 


#include "stdafx.h" 
#include «iostream» 
#include «cstdint» 
#include «random» 
#include "Cho7 04.h" 
#include "AlignedMem.h" 


using namespace std; 


void Init(uint8 t* x, size t n, unsigned int seed) 
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{ 
uniform int_distribution<> ui_dist {5, 250}; 
default random engine rng {seed}; 
for (size t i = 0; i< n; i++) 
x[i] = (uint8 t)ui dist(rng); 
// Use known values for min & max (for test purposes) 
x[(n/4)*3 +1] = 2; 
x[n / 4 +11] = 3; 
x[n / 2] = 252; 
x[n / 2 + 13] = 253; 
x[n / 8 + 5] = 4; 
x[n / 8 + 7] = 254; 
} 


bool AvxCalcMinMaxU8Cpp(const uint8 t* x, size t n, uint8 t* x min, uint8 t* x max) 


if (n == 0 || (n & ox3f) != 0) 
return false; 


if (!AlignedMem::IsAligned(x, 16)) 
return false; 


uint8 t x min temp = Oxff; 
uint8 t x max temp - 0; 


for (size t i = 0; i< n; i++) 


{ 
uint8_t val = *x++; 
if (val < x_min_temp) 
x_min_temp = val; 
else if (val > x_max_temp) 
x_max_temp = val; 
} 


*x min = x_min_temp; 
*x max = x max temp; 
return true; 


) 


void AvxCalcMinMaxU8() 
( 


size t n = c NumElements; 
AlignedArray«uint8 t» x aa(n, 16); 
uint8 t* x = x aa.Data(); 


Init(x, n, c RngSeedVal); 
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uint8 t x mini - O, x maxi - 0; 

uint8 t x min2 - O, x max2 - 0; 

bool rci - AvxCalcMinMaxU8Cpp(x, n, &x mini, &x max1); 
bool rc2 = AvxCalcMinMaxU8 (x, n, &x min2, &x max2); 


cout << "\nResults for AvxCalcMinMaxU8 n"; 


cout << "rci: " << rci << " x mini: " << (int)x mini; 
cout << " x maxi: " << (int)x maxi << 'Mn'; 
cout << "rc2: " << rcl << " x min2: " << (int)x min2; 


cout << x max2: " << (int)x max2 << 'Mn'; 


j 
int main() 
{ 
AvxCalcMinMaxU8() ; 
AvxCalcMinMaxU8 BM(); 
return 0; 
} 
P mm 
; Ch07_04_.asm 
P mm 
; extern "C" bool AvxCalcMinMaxU8 (uint8 t* x, size t n, uint8 t* x min, uint8 t* x max) 
; 
; Returns: O = invalid n or unaligned array, 1 = success 
.const 
align 16 


StartMinVal qword Offffffffffffffffh ;Initial packed min values 
qword Offffffffffffffffh 


StartMaxVal qword 0000000000000000h 3Initial packed max values 
qword 0000000000000000h 


.code 
AvxCalcMinMaxU8 proc 


; Make sure 'n' is valid 


xor eax,eax ;set error return code 
or rdx,rdx jis n == 0? 
jz Done ;jump if yes 
test rdx,3fh jis n a multiple of 64? 
jnz Done ;jump if no 
test rcx,Ofh ;is x properly aligned? 
jnz Done ;jump if no 


; Initialize packed min-max values 
vmovdqa xmm2,xmmword ptr [StartMinVal] 
vmovdqa xmm3,xmm2 ;Xmm3:xmm2 = packed min values 
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vmovdqa xmm4,xmmword ptr [StartMaxVal] 


vmovdqa 


xmm5 , xmm4. 


; Scan array for min & max values 


@@: vmovdqa 
vmovdqa 
vpminub 
vpminub 
vpmaxub 
vpmaxub 


vmovdqa 
vmovdqa 
vpminub 
vpminub 
vpmaxub 
vpmaxub 


add rcx, 
sub rdx, 


jnz GB 


xmmO,xmmword ptr [rcx] 
xmm1,xmmword ptr [rcx+16] 
xmm2 , xmm2 , xmmO 

xmm3 , xmm3 , xmm1 

xmm4 , xmm4., xmmO 

xmm5 , xmm5 , XmmL 


xmmO,xmmword ptr [rcx+32] 
xmm1,xmmword ptr [rcx+48] 
xmm2 , xmm2 , xmmO 
xmm3 , xmm3 , xmm1 
xmm4 , xmm4., xmmO 
xmm5 , xmm5 , xmm1 


64 
64 


; Determine final minimum value 


vpminub 
vpsrldq 
vpminub 
vpsrldq 
vpminub 
vpsrldq 
vpminub 
vpextrw 


xmmo , xmm2 , xmm3 
xmm1 , xmmo , 8 
xmm2 , xmm1 , xmmo 
xmm3 , xmm2 , 4 
xmmo , xmm3 , xmm2 
xmm1 , xmmo , 2 
xmm2 , xmm1 , xmmo 
eax, xmm2 , O 


cmp al,ah 


jbe GF 


mov al,ah 
QQ : mov [r8],al 


; Determine final maximum value 


vpmaxub 
vpsrldq 
vpmaxub 
vpsrldq 
vpmaxub 
vpsrldq 
vpmaxub 
vpextrw 


xmmo , xmm4 , xmm5 
xmm21 , xmmo , 8 
xmm2 , xmm1 , xmmo 
xmm3 , xmm2 , 4 
xmmo , xmm3 , xmm2 
xmm1 , xmmo , 2 
xmm2 , xmm1 , xmmo 
eax, xmm2 , O 


cmp al,ah 


jae GF 


mov al,ah 
QQ : mov [r9],al 


;Xmm5:xmm4 = packed max values 


;xmmo 
;xmmi = x[i + 31] 


x[i +15] : x[i] 
: x[i + 16] 


;xmm3:xmm2 = updated min values 

;xmm5:xmm4 = updated max values 

;xmmo = x[i + 47] : x[i + 32] 

;xmmi = x[i + 63] : x[i + 48] 

;xmm3:xmm2 = updated min values 

;xmm5:xmm4 = updated max values 
;xmmo[127:0] = final 16 min vals 
;xmm1[63:0] = xmmo[127:64] 
;xmm2[63:0] = final 8 min vals 
;xmm3[31:0] = xmm2[63:32] 
;xmmo[31:0] = final 4 min vals 
;xmm1[15:0] = xmmo[31:16] 
;xmm2[15:0] = final 2 min vals 
;ax - final 2 min vals 


;jump if al <= ah 
;al - final min value 
;save final min 


;xmmo[127:0] = final 16 max vals 
;xmm1[63:0] = xmmo[127:64] 
;xmm2[63:0] = final 8 max vals 
;xmm3[31:0] = xmm2[63:32] 
;xmmo[31:0] = final 4 max vals 
;xmm1[15:0] = xmmo[31:16] 
;xmm2[15:0] = final 2 max vals 


;ax - final 2 min vals 


;jump if al >= ah 
;al - final max value 
;save final max 
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mov eax,1 ;set success return code 
Done: ret 
AvxCalcMinMaxU8_ endp 

end 


Listing 7-4 begins with the source code for the header file AlignedMem.h. This file defines a couple 
of simple C++ classes that facilitate dynamically allocated aligned arrays. The class AlignedMem is a basic 
wrapper class for the Visual C++ runtime functions aligned malloc and aligned free. This class also 
includes a template member function named AlignedMem: : IsAligned that validates the alignment of an 
array in memory. The header file AlignedMem.h also defines a template class named AlignedArray. Class 
AlignedArray, which is used in this and subsequent source code examples, contains code that implements 
and manages dynamically allocated aligned arrays. Note that this class contains only minimal functionality 
to support the source code examples in this book, which is why many of the standard constructors and 
assignment operators are disabled. 

The primary C++ code in example Ch07_04 begins with the definition of a function name Init. This 
function initializes an array of unsigned 8-bit integers with random values in order to simulate the pixel 
values of an image. Function Init uses the C++ standard template library (STL) classes uniform_int_ 
distribution and default_random_engine to generate random values for the array. Appendix A contains 
a list of references that you can consult if you're interested in learning more about these classes. Note that 
function Init sets some of the pixel values in the target array to know values for test purposes. 

The function AvxCalcMinMaxU8Cpp implements a C++ version of the pixel value min-max algorithm. 
Parameters for this function include a pointer to the array, the number of array elements, and pointers for 
the minimum and maximum values. The algorithm itself consists of an unsophisticated for loop that sweeps 
though the array to find the minimum and maximum pixel values. Note that function AvxCalcMinMaxU8Cpp 
(and its counterpart assembly language function AvxCalcMinMaxUS8 ) requires the size of the array to be 
an even multiple of 64. The reason for this is that the assembly language function AvxCalcMinMaxU8 _ 
(arbitrarily) processes 64 pixels during each loop iteration, as you'll soon see. Also note that the source pixel 
array must be aligned to a 16-byte boundary. The C++ template function AlignedMem: : IsAligned performs 
this check. 

The C++ function AvxCalcMinMaxU8 contains code that initializes a test array and exercises the two 
pixel min-max functions. This function uses the aforementioned template class named AlignedArray 
to dynamically allocate an array of unsigned 8-bit integers that's aligned to a 16-byte boundary. The 
constructor arguments for this class include the number of array elements and the alignment boundary. 
Following the AlignedArray«uint8 t» x aa(n, 16) statement, AvxCalcMinMaxU8 obtains a raw C++ 
pointer to the array buffer using the member function AlignedArray::Data(). This pointer is passed as an 
argument to the two min-max functions. 

The assembly language function AvxCalcMinMaxU8 implements the same algorithm as its C++ 
counterpart with one significant difference. It processes array elements using 16-byte packets, which is 
the maximum number of unsigned 8-bit integers that can be stored in an XMM register. The function 
AvxCalcMinMaxU8 begins by validating the size of argument n. It then checks array x for proper alignment. 
Following argument validation, AvxCalcMinMaxU8 loads register pairs XMM3:XMM2 and XMM5:XMM4 
with the initial packed minimum and maximum values, respectively. This enables the processing loop to 
track 32 min-max values simultaneously. 

During each processing loop iteration, the function AvxCalcMinMaxU8 loads 32 pixel values into 
register pair XMM1:XMM0 using the instructions vmovdqa xmmO,xmmword ptr [rcx] and vmovdqa 
xmm1,xmmword ptr [rcx+16]. The next two instructions, vpminub xmm2,xmm2,xmmO and vpminub 
xmm3 , xmm3 , xmm1, update the current pixel minimums in register pair XMM3:XMM2. The ensuing vpmaxub 
instructions update the current pixel maximums in register pair XMM5:XMMA. Another sequence of 
vmovdqa, vpminub, and vpmaxub instructions handles the next group of 32 pixels. The processing of multiple 
data items during each loop iteration reduces the number of executed jump instructions and often results 
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in faster code. This optimization technique is commonly called loop unrolling (or unwinding). You'll learn 
more about loop unrolling and jump instruction optimization techniques in Chapter 15. 

Following the execution of the pixel value min-max processing loop, the values in register pairs 
XMM3:XMM2 and XMM5:XMMA must be reduced in order to obtain the final minimum and maximum 
values. The vpminub xmmO, xmm2, xmm3 instruction reduces the number of pixel minimum values from 32 to 
16. The next instruction, vpsrldq xmm1, xmmO, 8, right shifts the contents of XMM0 by eight bytes and saves 
the result in register XMMI (i.e., XMM1[63:0] = XMMO0[127:64]). This facilitates the use of the subsequent 
vpminub xmm2, xmm1, xmmO instruction that reduces the number of minimum values from 16 to 8. Two more 
vpsrldq-vpminub instruction sequences are then employed to reduce the number of pixel minimums to two 
as shown in Figure 7-3. The vpextrw eax, xmm2,0 extracts the low-order word (XMM2[15:0]) from register 
XMM2 and saves it to the low-order word of register EAX (or register AX). The cmp al, ah, jbe, and mov 
al, ah instructions determine the final pixel minimum value. AvxCalcMinMaxU8_ uses a similar reduction 
technique to determine the maximum pixel value. 


Initial values 


as [o] s Tope] + 
BED => Te Te Tp [e e] 7 


vpminub xmm0, xmm2, xmm3 


efe p [sp TSISTS S Te fo] PT] 


vpsrldq xmmi,xmm0,8 


xmm1i 
vpminub xmm2,*xmm1,xmmO 

xmm2 
vpsrldq xmm3,xmm2,4 

xmm3 
vpminub xmmO,xmm3,xmm2 

xmmO 
vpsrldq xmml,xmmO0,2 

xmmi 
vpminub xmm2,xmml, xmmO 

xmm2 


[| = Don’t care value 


Figure 7-3. Reduction of pixel minimum values using the instructions vpninub and vpsrldq 
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Here is the output for source code example Ch07 04: 


Results for AvxCalcMinMaxU8 
Ici: 1 x mini: 2 x maxi: 254 
Ic2: 1 x min2: 2 x max2: 254 


Running benchmark function AvxCalcMinMaxU8 BM - please wait 
Benchmark times save to file Cho7 04 AvxCalcMinMaxU8 BM CHROMIUM. csv 


Table 7-1 shows some timing measurements for the functions AvxCalcMinMaxU8 and AvxCalcMinMaxU8 _ 
using several different Intel processors. These measurements were made using the procedure that's 
described in Chapter 6. The benchmark source code for this and subsequent examples is not shown but 
included with the chapter download packages. 


Table 7-1. Pixel Value Min-Max Mean Execution Times (Microseconds), Array Size - 16 MB 


CPU AvxCalcMinMaxU8Cpp AvxCalcMinMaxU8_ 
i7-4790S 17642 1007 

i9-7900X 13638 874 

i7-8700K 12622 721 


Pixel Mean Intensity 


The next source code example, Ch07_05, contains code that calculates the arithmetic mean of an array of 
8-bit unsigned integers. This example also illustrates how to size-promote packed unsigned integers. 
Listing 7-5 shows the source code for example Ch07_05. 


Listing 7-5. Example Ch07_05 


#pragma once 
#include <cstdint> 


// Ch07 05.cpp 
extern void Init(uint8 t* x, size t n, unsigned int seed); 


extern bool AvxCalcMeanU8Cpp(const uint8 t* x, size t n, int64 t* sum x, double* mean); 


// Cho7 05 BM.cpp 
extern void AvxCalcMeanU8 BM(void); 


// Ch07 05 .asm 
extern "C" bool AvxCalcMeanU8 (const uint8 t* x, size t n, int64 t* sum x, double* mean); 


// Common constants 
const size t c NumElements - 16 * 1024 * 1024; // Must be multiple of 64 
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const size t c NumElementsMax = 64 * 1024 * 1024;  // Used to avoid overflows 
const unsigned int c RngSeedVal - 29; 


#include "stdafx.h" 
#include <iostream> 
#include «iomanip» 
#include «random» 
#include "Ch07 05.h" 
#include "AlignedMem.h" 


using namespace std; 
extern "C" size t g NumElementsMax = c NumElementsMax; // Used in .asm code 


void Init(uint8 t* x, size t n, unsigned int seed) 


( 
uniform int distribution«» ui dist (0, 255}; 
default random engine rng (seed); 
for (size t i = 0; i< nj; i++) 
x[i] = (uint8 t)ui dist(rng); 
) 


bool AvxCalcMeanU8Cpp(const uint8 t* x, size t n, int64 t* sum x, double* mean x) 


if (n == 0 || n > c NumElementsMax) 
return false; 


if ((n % 64) != 0) 
return false; 


if (!AlignedMem::IsAligned(x, 16)) 
return false; 


int64 t sum x temp - 0; 


for (int i = 0; i < n; i+) 
sum x temp += x[i]; 


*sum x = sum x temp; 


*mean x = (double)sum x temp / n; 
return true; 
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void AvxCalcMeanU8() 


1 
const size t n = c NumElements; 
AlignedArray«uint8 t» x aa(n, 16); 
uint8 t* x = x aa.Data(); 
Init(x, n, c RngSeedVal); 
bool rci, rc2; 
int64 t sum x1 = -1, sum x2 = -1; 
double mean x1 = -1, mean X2 = -1; 
rci - AvxCalcMeanU8Cpp(x, n, &sum x1, &mean x1); 
rc2 = AvxCalcMeanU8 (x, n, &sum x2, &mean x2); 
cout << "\nResults for MmxCalcMeanU8 Nn"; 
cout «« fixed «« setprecision(6); 
cout << "rci: " << rcd << " "j 
cout << "sum x1: " << sum X1 << " "5 
cout << "mean x1: " << mean x1 << '\n'; 
cout << "rc2: " << rc2 << " "5 
cout << "sum x2: " << sum X2 << " "5 
cout << "mean x2: " << mean x2 << '\n'; 
} 
int main() 
{ 
AvxCalcMeanU8(); 
AvxCalcMeanU8 BM(); 
return 0; 
} 
P mm 
; Ch07 05.asm 


include «MacrosX86-64-AVX.asmh» 
extern g NumElementsMax:qword 


; extern "C" bool AvxCalcMeanU8 (const Uint8* x, size t n, int64 t* sum x, double* mean); 
; 
; Returns O = invalid n or unaligned array, 1 = success 


.code 

AvxCalcMeanU8 proc frame 
_CreateFrame CM ,0,64 
_SaveXmmRegs xmm6,xmm7 , xmm8 , xmm9 
_EndProlog 
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; Verify function arguments 
xor eax,eax 
or rdx,rdx 
jz Done 


cmp rdx,[g NumElementsMax] 
jae Done 


test rdx,3fh 
jnz Done 


test rcx,Ofh 
jnz Done 


; Perform required initializations 
mov r10,rdx 
add rdx,rcx 
vpxor xmm8,xmm8,xmm8 
vpxor xmm9,xmm9,xmm9 


;set error 


; jump 
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M 
o 


if n= 


return code 


;jump if n > NumElementsMax 


;jump if (n % 64) != 0 


;jump if x is not properly aligned 


;save n for later use 
;rdx = end of array 


;xmm8 - packed intermediate sums (4 dwords) 
;xmm9 - packed zero for promotions 


; Promote 32 pixel values from bytes to words, then sum the words 


QQ : vmovdqa xmmO,xmmword ptr [rcx] 
vmovdqa xmm1,xmmword ptr [rcx+16] 
vpunpcklbw xmm2,xmmO, xmm9 
vpunpckhbw xmm3, xmmo , xmm9 
vpunpcklbw xmm4, xmm1,xmm9 
vpunpckhbw xmm5, xmm1,xmm9 
vpaddw xmmO,xmm2 ,xmm3 
vpaddw xmm1,xmm4, xmm5 
vpaddw xmm6 ,xmmo, xmm1 


;xmm1: 
;xmm2 
;xmm3 
;xmm4 
;xmm5 


;xmm6 


xmmO = 32 pixels 


- 8 words 
- 8 words 
- 8 words 
- 8 words 


packed sums (8 words) 


; Promote another 32 pixel values from bytes to words, then sum the words 


vmovdqa xmmo,xmmword ptr [rcx+32] 
vmovdqa xmm1,xmmword ptr [rcx+48] 
vpunpcklbw xmm2,xmmO , xmm9 
vpunpckhbw xmm3, xmmo , xmm9 
vpunpcklbw xmm4, xmm1,xmm9 
vpunpckhbw xmm5, xmm1 , xmm9 

vpaddw xmmo,xmm2, xmm3 

vpaddw xmm1,xmm4, xmm5 

vpaddw xmm7,xmmo, xmm1 


;xmm1: 
;xmm2 
;xmm3 
;xmm4 
;xmm5 


;xmm7 


xmmO = 32 pixels 


= 8 words 
= 8 words 
= 8 words 
= 8 words 


packed 


; Promote packed sums to dwords, then update dword sums 


vpaddw xmmo, xmm6 , xmm7 
vpunpcklwd xmm2,xmmO , xmm9 
vpunpckhwd xmm2,xmmO , xmm9 
vpaddd xmm8 , xmm8 , xmm1 
vpaddd xmm8, xmm8 , xmm2 


;xmmo 
;xmm1 
;xmm2 


- packed 
- packed 
- packed 


sums 


sums 
sums 
sums 


(8 words) 


(8 words) 
(4 dwords) 
(4 dwords) 
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add rcx,64 3rcx = next 64 byte block 
cmp rcx,rdx 
jne GB ;repeat loop if not done 


; Compute final sum x (note vpextrd zero extends extracted dword to 64 bits) 


vpextrd eax,xmm8,0 ;rax = partial sum 0 
vpextrd edx,xmm8,1 ;rdx - partial sum 1 
add rax,rdx 

vpextrd ecx,xmm8,2 3rcx = partial sum 2 
vpextrd edx,xmm8,3 ;rdx - partial sum 3 


add rax,rcx 
add rax,rdx 
mov [r8],rax ;save sum x 


; Compute mean value 


vcvtsi2sd xmmo,xmmo,rax ;xmmO = sum x (DPFP) 
vcvtsi2sd xmm1,xmm1,r10 ;xmm1 = n (DPFP) 
vdivsd xmm2,xmmo, xmm1 ;calc mean = sum x / n 
vmovsd real8 ptr [r9],xmm2 ;save mean 
mov eax,1 ;set success return code 
Done: | RestoreXmmRegs xmm6,xmm7,xmm8 , xmm9 
_DeleteFrame 
ret 
AvxCalcMeanU8  endp 
end 


The organization of the C++ code in example Ch07 05 is somewhat similar to the previous example. The 
C++ function AvxCalcMeanU8Cpp uses a simple summing loop and scalar arithmetic to calculate the mean 
of an array of 8-bit unsigned integers. Like the previous example, the number of array elements must be an 
integral multiple of 64 and the source array must be aligned to a 16-byte boundary. Note that the function 
AvxCalcMeanU8Cpp also verifies that the number of array elements is not greater than c NumElementsMax. 
This size restriction enables the assembly language function AvcCalcMeanU8 to carry out its calculations 
using packed doublewords sans any safeguards for arithmetic overflows. The remaining C++ code that's 
shown in Listing 7-5 performs test array initialization and streams results to cout. 

The assembly language function AvxCalcMeanU8 begins by performing the same validations of the 
array size as its C++ counterpart. The address of the array is also check for proper alignment. Following 
argument validation, AvxCalcMeanU8 carries out its required initializations. The add rdx, rcx instruction 
computes the address of the first byte beyond the end of the array. The function AvxCalcMeanU8 uses this 
address instead of a counter to terminate the processing loop. Register XMM8 is then initialized to all zeros. 
The processing loop uses this register to maintain intermediate packed doubleword sums. 

Each processing loop iteration begins with two vmovdga instructions that load 32 unsigned byte values 
into registers XMMI:XMMO. The pixel values are then size-promoted to words using the vpunpcklbw 
(Unpack Low Data) and vpunpckhbw (Unpack High Data). These instructions interleave the byte values 
contained in the two source operands to form word values, as shown in Figure 7-4. Note that register XMM9 
contains all zeros, which means that the unsigned byte values are zero extended during size promotion 
to words. A series of vpaddw instructions then sums the packed unsigned word values. The function 
AvxCalcMeanU8 processes another block of 32 pixels using the same sequence of instructions. The unsigned 
word sums in registers XMM6 and XMM7 are then summed using a vpaddw instruction, size-promoted to 
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doublewords using vpunpcklwd and vpunpckhwd, and added to the intermediate packed doubleword sums in 
register XMMB. Figure 7-4 illustrates this instruction sequence in greater detail. 


Initial values 


efe Po [s poppe] s [> [De [P|] ne 


H 
: 


vpunpcklbw xmm2,xmmO, xmm9 : 


o |a |e] a| [= [=] se jo 


vpunpckhbw xmm3, xmm0, xmm9 1 


ee |e |e |e e] 


vpunpcklwd xmm4, xmm3, xmm9 H 


m | » | e = 


vpunpckhwd xmm5, xmm3, xmm9 ; 


Ce f= vs 


H 
: 
H 


High-order values : Low-order values 


Figure 7-4. Execution of the vpunpck[h]1]bw, and vpunpck[h] 1 ]wd instructions 


Following termination of the processing loop, the intermediate doubleword sums in register XMM8 
are totaled to generate the final pixel sum. The function uses several vpextrd instructions to copy each 
doubleword value from XMMB8 to a general-purpose register. Note that this instruction uses an immediate 
operand to specify which element value to copy. Following computation of the pixel sum, AvxCalcMeanU8 - 
calculates the final pixel mean using simple scalar arithmetic. Here are the results for source code example 
Cho7 05: 


Results for AvxCalcMeanU8 
rc1: 1 sum x1: 2139023922 mean x1: 127.495761 
rc2: 1 sum x2: 2139023922 mean x2: 127.495761 


Running benchmark function AvxCalcMeanU8 BM - please wait 
Benchmark times save to file Ch07 05 AvxCalcMeanU8 BM CHROMIUM. csv 
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Table 7-2 shows some benchmark timing measurements for source code example Ch07_05. 


Table 7-2. Source Code Example Ch07 05 Mean Execution Times (Microseconds), Array Size - 16 MB 


CPU AvxCalcMeanU8Cpp AvxCalcMeanU8 
i7-4790S 7103 1063 

i9-7900X 6332 1048 

i7-8700K 5870 861 


Pixel Conversions 


In order to implement certain image-processing algorithms, it is often necessary to convert the pixels 

of an 8-bit grayscale image from unsigned integer to single-precision floating-point values and vice versa. 
The sample code example of this section, Ch07 06, illustrates how to do this using the AVX instruction set. 
Listing 7-6 shows the source code for example Ch07_06. 


Listing 7-6. Example Ch07_06 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 
#include <cstdint> 
#include «random» 
#include "AlignedMem.h" 


using namespace std; 

// Cho7 06 Misc.cpp 

extern uint32 t ConvertImgVerify(const float* srci, const float* src2, uint32_t num pixels); 
extern uint32 t ConvertImgVerify(const uint8 t* src1, const uint8 t* src2, uint32 t num. 
pixels); 

// Cho7 06 .asm 

extern "C" bool ConvertImgU8ToF32 (float* des, const uint8 t* src, uint32 t num pixels); 
extern "C" bool ConvertImgF32ToU8 (uint8 t* des, const float* src, uint32 t num pixels); 
extern "C" uint32 t c NumPixelsMax - 16777216; 


template «typename T» void Init(T* x, size t n, unsigned int seed, T scale) 


uniform int distribution«» ui dist (0, 255j; 
default random engine rng {seed}; 
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for (size t i = 0; i< n; i++) 
{ 
(T)ui dist(rng); 


T temp - 
- (scale -- 1) ? temp : temp / scale; 


x[i] 


bool ConvertImgU8ToF32Cpp(float* des, const uint8 t* src, uint32 t num pixels) 


{ 


} 


// Make sure num pixels is valid 

if ((num pixels == 0) || (num pixels > c NumPixelsMax)) 
return false; 

if ((num pixels % 32) != 0) 
return false; 


// Make sure src and des are aligned to a 16-byte boundary 
if (!AlignedMem::IsAligned(src, 16)) 

return false; 
if (!AlignedMem::IsAligned(des, 16)) 

return false; 


// Convert the image 
for (uint32 t i = 0; i « num pixels; i++) 


des[i] = src[i] / 255.0f; 


return true; 


bool ConvertImgF32ToU8Cpp(uint8 t* des, const float* src, uint32 t num pixels) 


{ 


// Make sure num pixels is valid 

if ((num pixels == 0) || (num pixels > c NumPixelsMax)) 
return false; 

if ((num pixels % 32) != 0) 
return false; 


// Make sure src and des are aligned to a 16-byte boundary 
if (!AlignedMem::IsAligned(src, 16)) 

return false; 
if (!AlignedMem::IsAligned(des, 16)) 

return false; 


for (uint32 t i = 0; i « num pixels; i++) 


{ 
if (src[i] > 1.0f) 
des[i] = 255; 
else if (src[i] « 0.0) 
des[i] = 0; 
else 
des[i] = (uint8 t)(src[i] * 255.0f); 
) 
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return true; 


} 
void ConvertImgU8ToF32(void) 
{ 
const uint32_t num pixels = 1024; 
AlignedArray«uint8 t» src aa(num pixels, 16); 
AlignedArray«float» desi aa(num pixels, 16); 
AlignedArray«float» des2 aa(num pixels, 16); 
uint8 t* src - src aa.Data(); 
float* desi = desi aa.Data(); 
float* des2 - des2 aa.Data(); 
Init(src, num pixels, 12, (uint8 t)1); 
bool rci = ConvertImgU8ToF32Cpp(desi, src, num pixels); 
bool rc2 = ConvertImgU8ToF32 (des2, src, num pixels); 
if (!rc1 || !rc2) 
( 
cout << "Invalid return code - "; 
cout << "rci = " << boolalpha << rci << ", "5 
cout << "rc2 = " << boolalpha << rc2 << '\n'; 
return; 
} 
uint32 t num diff = ConvertImgVerify(des1, des2, num pixels); 
cout << "\nResults for ConvertImgU8ToF32\n"; 
cout << " num pixels = " << num pixels << '\n'; 
cout << " num diff = " << num diff << 'Mn'; 
} 


void ConvertImgF32ToU8(void) 

{ 
const uint32 t num pixels = 1024; 
AlignedArray«float» src aa(num pixels, 16); 
AlignedArray«uint8 t» desi aa(num pixels, 16); 
AlignedArray«uint8 t» des2 aa(num pixels, 16); 
float* src - src aa.Data(); 
uint8 t* desi = des1 aa.Data(); 
uint8 t* des2 - des2 aa.Data(); 


// Initialize the src pixel buffer. The first few entries in src 
// are set to known values for test purposes. 


Init(src, num pixels, 20, 1.0f); 


src[0] = 0.125f; src[8] = 0.01f; 
src[1] = 0.75f; src[9] = 0.99f; 
src[2] = -4.0f; src[10] = 1.1f; 
src[3] = 3.0f; src[11] = -1.1f; 
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src[4] = 0.0f; src[12] = 0.99999f; 
src[5] = 1.0f; src[13] = 0.5f; 
src[6] = -0.01f; src[14] = -0.0; 
src[7] = 1.01f; src[15] = .333333f; 


bool rci = ConvertImgF32ToU8Cpp(desi, src, num pixels); 
bool rc2 = ConvertImgF32ToU8 (des2, src, num pixels); 


if (!rc1 || !rc2) 
( 
cout << "Invalid return code - "; 
cout << "rc1 = " 
cout << "rc2 = 


return; 


<< boolalpha << rci << ", "5 
<< boolalpha << rc2 << '\n'; 


} 


uint32 t num diff = ConvertImgVerify(desi, des2, num pixels); 
cout << "\nResults for ConvertImgF32ToU8\n"; 


cout << " num pixels = " << num pixels << '\n'; 
cout << " num diff = " << num diff << 'Mn'; 
) 
int main() 
{ 
ConvertImgU8ToF32(); 
ConvertImgF32ToU8(); 
return 0; 
) 
j mm 
; Ch07_06.asm 
5 mm 
include <MacrosX86-64-AVX.asmh> 
include <cmpequ.asmh> 
.const 
align 16 
Uint8ToFloat real4 255.0, 255.0, 255.0, 255.0 
FloatToUint8Min real4 0.0, 0.0, 0.0, 0.0 
FloatToUint8Max real4 1.0, 1.0, 1.0, 1.0 


FloatToUint8Scale ^ real4 255.0, 255.0, 255.0, 255.0 
extern c NumPixelsMax:dword 
; extern "C" bool ConvertImgU8ToF32 (float* des, const uint8 t* src, uint32 t num pixels) 


.code 
ConvertImgU8ToF32_ proc frame 
_CreateFrame U2F_,0,160 
_SaveXmmRegs xmm6,xmm7,xmm8,xmm9, xmm10 , xmm11 , xmm12 , xmm13 , xmm14 , xmm15 
_EndProlog 
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; Make sure num pixels is valid and pixel 
xor eax,eax 
or r8d,r8d 
jz Done 
cmp r8d,[c_NumPixelsMax] 
ja Done 
test r8d,1fh 
jnz Done 
test rcx,Ofh 
jnz Done 
test rdx,Ofh 
jnz Done 


; Initialize processing loop registers 
shr r8d,5 
vmovaps xmm6,xmmword ptr [Uint8ToF 
vpxor xmm7 ,xmm7 , xmm7 


; Load the next block of 32 pixels 
@@: vmovdqa xmmO,xmmword ptr [rdx] 
vmovdqa xmm1,xmmword ptr [rdx+16] 


; Promote the pixel values in xmmO from un 
vpunpcklbw xmm2,xmmo,xmm7 
vpunpckhbw xmm3, xmmo , xmm7 
vpunpcklwd xmm8,xmm2, xmm7 
vpunpckhwd xmm9, xmm2 , xmm7 
vpunpcklwd xmm10,xmm3,xmm7 
vpunpckhwd xmm11,xmm3,xmm7 


; Promote the pixel values in xmm1 from un 
vpunpcklbw xmm2,xmm2, xmm7 
vpunpckhbw xmm3 , xmm1 , xmm7 
vpunpcklwd xmm12 ,xmm2 ,xmm7 
vpunpckhwd xmm13,xmm2,xmm7 
vpunpcklwd xmm14,xmm3,xmm7 
vpunpckhwd xmm15,xmm3 , xmm7 


; Convert pixel values from dwords to SPFP 
vcvtdq2ps xmm8,xmm8 
vcvtdq2ps xmm9,xmm9 
vcvtdq2ps xmm10,xmm10 
vcvtdq2ps xmm11,xmm11 


vcvtdq2ps 
vcvtdq2ps 
vcvtdq2ps 
vcvtdq2ps 


xmm412 , xmm12 
xmm13 , xmm13 
xmm14 , xmm14. 
xmm215 , xmm1 5 
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buffers are properly aligned 
;set error return code 


;jump if num pixels is zero 
;jump if num pixels too big 
;jump if num pixels % 32 != 0 
;jump if des not aligned 
;jump if src not aligned 
;number of pixel blocks 


loat] ;xmm6 - packed 255.0f 
;xmm7 = packed O 


16 pixels (x[i+15]:x[i]) 
16 pixels (x[i+31]:x[i+16]) 


;xmmo 
;xmm8 


signed bytes to unsigned dwords 


;xmmi1:xmm8 = 16 dword pixels 


signed bytes to unsigned dwords 


;Xmm15:xmm12 = 16 dword pixels 


;xmm11:xmm8 = 16 SPFP pixels 


;Xmm15:xmm12 = 16 SPFP pixels 
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; Normalize all pixel values to [0.0, 1.0] and save the results 
vdivps xmmO,xmm8 , xmm6 
vmovaps xmmword ptr [rcx],xmmo ;save pixels 0 - 3 
vdivps xmm1, xmm9, xmm6 
vmovaps xmmword ptr [rcx+16],xmm1 ;save pixels 4 - 7 
vdivps xmm2,xmm10,xmm6 
vmovaps xmmword ptr [rcx-32],xmm2 ^ ;save pixels 8 - 11 
vdivps xmm3,xmm11,xmm6 
vmovaps xmmword ptr [rcx-48],xmm3 ^ ;save pixels 12 - 15 


vdivps xmmO,xmm12,xmmé 
vmovaps xmmword ptr [rcx464],xmmO ^ ;save pixels 16 - 19 
vdivps xmm1,xmm13,xmmé 
vmovaps xmmword ptr [rcx+80],xmm1 ^ ;save pixels 20 - 23 
vdivps xmm2,xmm14,xmm6 
vmovaps xmmword ptr [rcx496],xmm2 ^ ;save pixels 24 - 27 
vdivps xmm3,xmm15,xmm6 
vmovaps xmmword ptr [rcx«112],xmm3  ;save pixels 28 - 31 


add rdx,32 ;update src ptr 
add rcx,128 ;update des ptr 
sub r8d,1 
jnz GB ;repeat until done 
mov eax,1 ;set success return code 
Done: 4 RestoreXmmRegs xmm6,xmm7,xmm8, xmm9, xmm10 , xmm11 , xmm12 , xmm13 , xmm14 , xmm15 
_DeleteFrame 
ret 


ConvertImgU8ToF32 endp 
; extern "C" bool ConvertImgF32ToU8 (uint8 t* des, const float* src, uint32_t num pixels) 
ConvertImgF32ToU8 proc frame 


_CreateFrame F2U_,0,96 
_SaveXmmRegs xmm6,xmm7,xmm12,xmm13,xmm14., xmm15 


_EndProlog 

; Make sure num pixels is valid and pixel buffers are properly aligned 
xor eax,eax ;set error return code 
or r8d,r8d 
jz Done ;jump if num pixels is zero 
cmp r8d,[c NumPixelsMax] 
ja Done ;jump if num pixels too big 
test r8d,1fh 
jnz Done ;jump if num pixels % 32 !- 0 
test rcx,Ofh 
jnz Done ;jump if des not aligned 
test rdx,Ofh 
jnz Done ;jump if src not aligned 
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; Load required packed constants into registers 


LP1: 


vmovaps xmm13,xmmword ptr [FloatToUint8Scale] ;xmm13 
vmovaps xmm14,xmmword ptr [FloatToUint8Min] 
vmovaps xmm15,xmmword ptr [FloatToUint8Max] 


shr r8d,4 
mov r9d,4 


; Convert 16 float pixels to uint8 t 


LP2: 


vmovaps xmmO,xmmword ptr [rdx] 
vcmpps xmm1,xmmO,xmm14,CMP LT 
vandnps xmm2,xmm1,xmmo 


vcmpps xmm3,xmm2,xmm15,CMP GT 
vandps xmm4,xmm3,xmm15 
vandnps xmm5,xmm3,xmm2 

vorps xmm6,xmm5,xmm4 

vmulps xmm7, xmm6 , xmm13 


vcvtps2dq xmmo,xmm7 
vpackusdw xmm1, xmmO, xmmo 
vpackuswb xmm2,xmm2,xmm1 


; Save the current byte pixel quartet 


vpextrd eax,xmm2,0 
vpsrldq xmm12,xmm12,4 
vpinsrd xmm12,xmm12,eax, 3 


add rdx,16 
sub r9d,1 
jnz LP2 


packed 255.0 
packed 0.0 
packed 1.0 


;xmm14 
;xmm15 


;number of pixel blocks 
;num pixel quartets per block 


;xmmO = next pixel quartet 
;compare pixels to 0.0 
;clip pixels « 0.0 to 0.0 


;compare pixels to 1.0 

;clip pixels » 1.0 to 1.0 
;xmm5 - pixels «- 1.0 

;xmm6 - final clipped pixels 
;xmm7 = FP pixels [0.0, 255.0] 


;xmmO = dword pixels [0, 255] 
;xmm1[63:0] = word pixels 
;xmm2[31:0] = bytes pixels 


;eax = new pixel quartet 
;adjust xmm12 for new quartet 
;xmm12[127:96] = new quartet 


;update src ptr 


;repeat until done 


; Save the current byte pixel block (16 pixels) 


Done: 


vmovdqa xmmword ptr [rcx],xmmi2 


add rcx,16 
sub r8d,1 
jnz LP1 

mov eax,1 


;save current pixel block 
;update des ptr 


;repeat until done 
;set success return code 


 RestoreXmmRegs xmm6 ,xmm],xmm12,xmm13,xmm14,xmm15 


_DeleteFrame 
ret 


ConvertImgF32ToU8  endp 
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The C++ code in Listing 7-6 is straightforward. The function ConvertImgU8ToF32Cpp contains 
code that converts pixel values from uint8 t [0, 255] to single-precision floating-point [0.0, 1.0]. 
This function contains a simple for loop that calculates des[i] = src[i] / 255.0. The counterpart 
function ConvertImgF32ToU8Cpp performs the inverse operation. Note that this function clips any pixel 
values greater than 1.0 or less than 0.0 before performing the floating-point to uint8 t conversion. 

The functions Convert ImgU8ToF32 and ConvertImgF32ToU8 contain code that initialize test arrays and 
exercise the C++ and assembly language conversion routines. Note that the latter function initializes 
the first few entries of the source buffer to known values in order to demonstrate the aforementioned 
clipping operation. 

The processing loop of the assembly language function ConvertImgU8ToF32 converts 32 pixels from 
uint8 t (or byte) to single-precision floating-point during each iteration. The conversion technique begins 
with the size promotion of packed pixels from unsigned byte to unsigned doubleword integers using a series 
of vpunpck[h|1]bw and vpunpck[h]|1]wd instructions. The doubleword values are then converted to single- 
precision floating-point values using the instruction vcvtdq2ps (Convert Packed Doubleword Integers to 
Packed Single-Precision Floating-Point Values). The resultant packed floating-point values are normalized 
to [0.0, 1.0] and saved to the destination buffer. 

The assembly language function ConvertImgF32ToU8 performs packed single-precision floating- 
point to packed unsigned byte conversions. The inner loop (starting at the label LP2) of this conversion 
function uses the instructions vcmpps xmm1, xmmO,xmm14,CMP LT, vcmpps xmm3,xmm2,xmm15,CMP GT, 
and some Boolean logic to clip any pixels values less than 0.0 or greater than 1.0. Figure 7-5 illustrates 
this technique in greater detail. The vcvtps2dq xmmo,xmm7 instruction converts the four single- 
precision floating-point values in XMM7 to doubleword integers and saves the results in register 
XMM6O. The next two instructions, vpackusdw xmm1,xmmO,xmmO and vpackuswb xmm2,xmm1, xmm1, 
size-reduce the packed doubleword integers to packed unsigned bytes. Following the execution of 
the vpackuswb instruction, register XMM2[31:0] contains four packed unsigned byte values. This pixel 
quartet is then copied to XMM12[127:96] using the instruction sequence vpextrd eax,xmm2,0, vpsrldq 
xmm12,xmm12,4, and vpinsrd xmm12,xmm12,eax, 3. The vpinsrd (Insert Dword) instructions that's used 
here copies the doubleword value in register EAX to doubleword element position 3 in register XMMI2 
(or XMM12[127:96]). 
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Packed constants 


xmm13 
xmm14 


xmmi5 


xmmO 


xmmi 


vandnps xrmm2,xmm1l,xmmO 


xmm2 


xmm3 


xmm4 


xmms5 


xmm6 


xmm7 


Figure 7-5. Illustration of floating-point clipping technique used in function ConvertImgF32ToU8 


The inner loop conversion process that's described in the previous paragraph executes for four 
iterations. Following the completion of the inner loop, XMMI2 contains 16 unsigned byte pixel values. This 
pixel block is then saved to the destination buffer using a vnovdqa xmmword ptr [rcx],xmm12 instruction. 
The outer loop repeats until all pixels have been converted. Here is the output for source code example 
Ch07 06. 
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Results for ConvertImgU8ToF32 
num pixels = 1024 
num diff = 0 


Results for ConvertImgF32ToU8 
num pixels = 1024 
num diff = 0 


Image Histograms 


Many image-processing algorithms require a histogram of an image's pixel intensity values. Figure 7-6 shows 
a sample grayscale image and its histogram. The next source code example, Ch07 07, illustrates how to build 
a histogram of pixel intensity values for an image containing 8-bit grayscale pixel values. This example also 
explains how to use the stack in an assembly language function to store intermediate results. Listing 7-7 
shows the source code for example Ch07 07. 


Figure 7-6. Sample grayscale image and its histogram 
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Listing 7-7. Example Ch07 07 


#pragma once 
#include <cstdint> 


// Ch07 07.cpp 
extern bool AvxBuildImageHistogramCpp(uint32 t* histo, const uint8 t* pixel buff, uint32 t 
num pixels); 


// Ch07 07 .asm 

// Functions defined in Sse64ImageHistogram_.asm 

extern "C" bool AvxBuildImageHistogram (uint32 t* histo, const uint8 t* pixel buff, uint32 t 
num pixels); 


// Cho7 07 BM.cpp 
extern void AvxBuildImageHistogram BM(void); 


#include "stdafx.h" 
#include «cstdint» 
#include <iostream> 
#include <iomanip> 
#include <fstream> 
#include <string> 
#include "Cho7 07.h" 
#include "AlignedMem.h" 
#include "ImageMatrix.h" 


using namespace std; 
extern "C" uint32 t c NumPixelsMax = 16777216; 


bool AvxBuildImageHistogramCpp(uint32 t* histo, const uint8 t* pixel buff, uint32 t num. 
pixels) 
{ 
// Make sure num pixels is valid 
if ((num pixels -- 0) || (num pixels > c NumPixelsMax)) 
return false; 


if (num pixels % 32 != 0) 
return false; 


// Make sure histo is aligned to a 16-byte boundary 


if (!AlignedMem::IsAligned(histo, 16)) 
return false; 
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// Make sure pixel_buff is aligned to a 16-byte boundary 
if (!AlignedMem::IsAligned(pixel buff, 16)) 
return false; 


// Build the histogram 
memset(histo, 0, 256 * sizeof(uint32_t)); 


for (uint32 t i = 0; i « num pixels; i++) 
histo[pixel buff[i]]++; 


return true; 


} 


void AvxBuildImageHistogram(void) 
{ 
const wchar t* image fn = L"..\\Ch07 Data\\TestImage1.bmp"; 
const wchar t* csv fn = L"Cho7 07 AvxBuildImageHistogram Histograms.csv"; 


ImageMatrix im(image fn); 

uint32 t num pixels - im.GetNumPixels(); 

uint8 t* pixel buff - im.GetPixelBuffer«uint8 t»(); 
AlignedArray«uint32 t» histo1 aa(256, 16); 
AlignedArray«uint32 t» histo2 aa(256, 16); 


bool rci = AvxBuildImageHistogramCpp(histo1 aa.Data(), pixel buff, num pixels); 


bool rc2 = AvxBuildImageHistogram (histo2 aa.Data(), pixel buff, num pixels); 
cout << "\nResults for AvxBuildImageHistogram\n"; 

if (!rc1 || !rc2) 

{ 


cout << "Bad return code: "; 
cout << "rci = " << rc4 << 
return; 


jz TC2 =" << re2 << "An's 


} 


ofstream ofs(csv fn); 


if (ofs.bad()) 
cout << "File create error - 
else 


{ 


<< csv fn << '\n'; 


bool compare error = false; 
uint32 t* histo1 = histo1 aa.Data(); 
uint32 t* histo2 = histo2 aa.Data(); 


const char* delim = ", "; 


for (uint32 t i = 0; i « 256; i++) 
{ 


ofs << i << delim; 
ofs << histo1[i] << delim << histo2[i] << ‘\n'; 
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if (histoi[i] !- histo2[i]) 


compare error - true; 
cout << " Histogram compare error at index " << i << '\n'; 
cout «« counts: " << histo1[i] << delim << histo2[i] << '\n'; 


) 


if (!compare error) 
cout << " Histograms are identical\n"; 


ofs.close(); 


} 
int main() 
1 
try 
{ 
AvxBuildImageHistogram(); 
AvxBuildImageHistogram BM(); 
) 
catch (...) 
{ 
cout << "Unexpected exception has occurred\n"; 
cout << "File = "<< FILE << ‘\n'; 
} 
return 0; 
} 
P mm 
5 Cho7 07.asm 


include <MacrosX86-64-AVX.asmh> 


; extern bool AvxBuildImageHistogram (uint32 t* histo, const uint8 t* pixel buff, uint32 t 
num pixels) 


3 
; Returns: O = invalid argument value, 1 = success 


.code 
extern c NumPixelsMax : dword 


AvxBuildImageHistogram proc frame 
_CreateFrame BIH ,1024,0,rbx,rsi,rdi 


_EndProlog 


; Make sure num pixels is valid 
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xor eax,eax 
test r8d,r8d 

jz Done 

cmp r8d,[c_NumPixelsMax] 
ja Done 

test r8d,1fh 

jnz Done 
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;set error code 
;jump if num pixels is zero 
;jump if num pixels too big 


;jump if num pixels % 32 !- 0 


; Make sure histo & pixel buff are properly aligned 


mov rsi,rcx 
test rsi,0fh 
jnz Done 

mov r9,rdx 
test r9,0fh 
jnz Done 


3rsi = ptr to histo 


;junp if histo misaligned 


;jump if pixel buff misaligned 


; Initialize local histogram buffers (set all entries to zero) 


XOI eax,eax 
mov rdi,rsi 
mov rcx,128 
rep stosq 

mov rdi,rbp 
mov rcx,128 
rep stosq 


; Perform processing loop initializations 


shr r8d,5 
mov rdi,rbp 


; Build the histograms 


QQ: 


align 16 
vmovdqa xmmO,xmmword ptr [r9] 
vmovdqa xmm1,xmmword ptr [r9+16] 


; Process pixels O - 3 


vpextrb rax,xmmo,O 
add dword ptr [rsi+rax*4],1 
vpextrb rbx,xmmo,1 
add dword ptr [rdi+rbx*4],1 
vpextrb rcx,xmmo,2 
add dword ptr [rsi+rcx*4],1 
vpextrb rdx,xmmo,3 
add dword ptr [rdi+rdx*4],1 


; Process pixels 4 - 7 


vpextrb rax,xmmO,4 
add dword ptr [rsi+rax*4],1 
vpextrb rbx,xmmo,5 
add dword ptr [rdi+rbx*4],1 
vpextrb rcx,xmmo,6 
add dword ptr [rsi+rcx*4],1 


;rdi = ptr to histo 
;YCX = size in qwords 
;zero histo 

;rdi = ptr to histo2 
;YCX = size in qwords 
;zero histo2 


;number of pixel blocks (32 pixels/block) 


;ptr to histo2 


jalign jump target 
;load pixel block 
;load pixel block 


;count pixel O 
;count pixel 1 
;count pixel 2 


;count pixel 3 


;count pixel 4 
;count pixel 5 


;count pixel 6 
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vpextrb rdx,xmmo,7 
add dword ptr [rdi+rdx*4],1 


; Process pixels 8 - 11 


vpextrb rax,xmmo,8 

add dword ptr [rsi+rax*4],1 
vpextrb rbx,xmmo,9 

add dword ptr [rdi+rbx*4],1 
vpextrb rcx,xmmo,10 

add dword ptr [rsi+rex*4],1 
vpextrb rdx,xmmo,11 

add dword ptr [rdi+rdx*4],1 


; Process pixels 12 - 15 


vpextrb rax,xmmo,12 
add dword ptr [rsi+rax*4],1 
vpextrb rbx,xmmo,13 
add dword ptr [rdi+rbx*4],1 
vpextrb rcx,xmmo,14 
add dword ptr [rsi+rcx*4],1 
vpextrb rdx,xmmo,15 
add dword ptr [rdi+rdx*4],1 


; Process pixels 16 - 19 


vpextrb rax,xmm1,0 
add dword ptr [rsi+rax*4],1 
vpextrb rbx,xmm1,1 
add dword ptr [rdi+rbx*4],1 
vpextrb rcx,xmmi,2 
add dword ptr [rsi+rcx*4],1 
vpextrb rdx,xmm1,3 
add dword ptr [rdi+rdx*4],1 


; Process pixels 20 - 23 


vpextrb rax,xmmi,4 
add dword ptr [rsi+rax*4],1 
vpextrb rbx,xmmi,5 
add dword ptr [rdi+rbx*4],1 
vpextrb rcx,xmmi,6 
add dword ptr [rsi+rcx*4],1 
vpextrb rdx,xmm1, 7 
add dword ptr [rdi+rdx*4],1 


; Process pixels 24 - 27 
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vpextrb rax,xmm1,8 

add dword ptr [rsi+rax*4],1 
vpextrb rbx,xmm1,9 

add dword ptr [rdi+rbx*4],1 
vpextrb rcx,xmm1, 10 

add dword ptr [rsi+rcx*4],1 
vpextrb rdx,xmm1,11 

add dword ptr [rdi+rdx*4],1 


;count 


;count 
;count 
;count 


;count 


;count 
;count 
;count 


;count 


;count 
;count 
;count 


;count 


;count 
;count 
;count 


; count 


;count 
;count 
;count 


;count 


pixel 


pixel 
pixel 
pixel 


pixel 


pixel 
pixel 
pixel 


pixel 


pixel 
pixel 
pixel 


pixel 


pixel 
pixel 
pixel 


pixel 


pixel 
pixel 
pixel 


pixel 
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3 Process pixels 28 - 31 
vpextrb rax,xmm1,12 


add dword ptr [rsi+rax*4],1 ;count pixel 28 

vpextrb rbx,xmmi,13 

add dword ptr [rdi+rbx*4],1 ;count pixel 29 

vpextrb rcx,xmm1, 14 

add dword ptr [rsi+rcx*4],1 ;count pixel 30 

vpextrb rdx,xmm1,15 

add dword ptr [rdi+rdx*4],1 ;count pixel 31 

add r9,32 ;Y9 = next pixel block 
sub r8d,1 

jnz GB ;repeat loop if not done 


; Merge intermediate histograms into final histogram 


mov ecx,32 ;ecx = num iterations 
xor eax, eax ;rax = common offset 

@@: vmovdqa xmmo,xmmword ptr [rsi+rax] ;load histo counts 
vmovdqa xmm1,xmmword ptr [rsi+rax+16] 
vpaddd xmmo,xmmo,xmmword ptr [rdi+rax] ;add counts from histo2 
vpaddd xmm1,xmm1,xmmword ptr [rdi+rax+16] 
vmovdqa xmmword ptr [rsi+rax],xmmO ;save final result 


vmovdqa xmmword ptr [rsi+rax+16],xmm1 


add rax,32 

sub ecx,1 

jnz GB 

mov eax,1 ;set success return code 


Done: _DeleteFrame rbx,rsi,rdi 
ret 

AvxBuildImageHistogram_ endp 
end 


Near the top of the C++ code is a function named AvxBuildImageHistogramCpp. This function 
constructs an image histogram using a rudimentary technique. Prior to the histogram’s actual construction, 
the number of image pixels is validated for size (greater than 0 and not greater than c NumPixelMax) and 
divisibility by 32. The divisibility test is performed to ensure compatibility with the assembly language 
function AvxBuildImageHistogram_. Next, the addresses of histo and pixel_buff are verified for proper 
alignment. The call to memset initializes each histogram pixel count bin to zero. A simple for loop is then 
used to construct the histogram. 

The function AvxBuildImageHistogram uses a C++ class named ImageMatrix to load the pixels of an 
image into memory. (The source code for ImageMatrix is not shown but included as part of the chapter 
download package.) The variables num_pixels and pixel_buff are then initialized using the member 
functions ImageMatrix: :GetNumPixels and ImageMatrix: :GetPixelBuffer. Two histogram buffers then 
are allocated using the C++ template class AlignedArray<uint32_t>. Following the construction of the 
histograms using the functions AvxBuildImageHistogramCpp and AvxBuildImageHistogram_, the pixel 
counts in the two histogram buffers are compared for equivalence and written to a comma-separated-value 
text file. 
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The assembly language function AvxBuildImageHistogram constructs an image histogram using the 
AVX instruction set. In order to improve performance, this function builds two intermediate histograms 
and merges them into a final histogram. AvxBuildImageHistogram_ begins by creating a stack frame 
using the CreateFrame macro. Note that the stack frame created by CreateFrame includes 1024 bytes 
(256 doublewords, one for each grayscale intensity level) of local storage space, which is used for one 
of the intermediate histogram buffers. Following the execution of the code generated by CreateFrame, 
register RBP points to the intermediate histogram on the stack (see Figure 5-6). The caller-provided buffer 
histo is used as the second intermediate histogram buffer. Following the EndProlog macro, the function 
AvxBuildImageHistogram_ validates num pixels for size and divisibility by 32; it the checks the addresses 
of histo and pixel buff for proper alignment. The count values in both intermediate histograms are then 
initialized to zero using the stosq instruction. 

The main processing loop begins with two vmovdqa instructions that load 32 image pixels into registers 
XMM1:XMMO. Note that prior to the first vmovdqa instruction, the MASM directive align 16 is used to align 
this instruction on a 16-byte boundary. Aligning the target of a jump instruction on a 16-byte boundary 
is an optimization technique that often improves performance. Chapter 15 discusses this and other 
optimization techniques in greater detail. Next, a vpextrb rax,xmmO,0 instruction extracts pixel element 0 
(i.e., XMMO[7:0]) from register XMMO and copies it to the low-order bits of register RAX; the high-order bits 
of RAX are set to zero. The ensuing add dword ptr [rsi+rax*4],1 instruction updates the appropriate pixel 
count bin in the first intermediate histogram. The next two instructions, vpextrb rbx,xmmO, 1 and add dword 
ptr [rdi+rbx*4],1, process pixel element 1 in the same manner using the second intermediate histogram. 
This pixel-processing technique is then repeated for the remaining pixels in the current block. 

Following the execution of the processing loop, the pixel count values in the two intermediate 
histograms are summed using packed integer arithmetic to create the final histogram. The DeleteFrame 
macro is then used to release the local stack frame and restore the previously-saved non-volatile general- 
purpose registers. Here is the output for source code example Ch07 07: 


Results for AvxBuildImageHistogram 
Histograms are identical 


Running benchmark function AvxBuildImageHistogram BM - please wait 
Benchmark times save to file Ch07 07 AvxBuildImageHistogram BM CHROMIUM. csv 


Table 7-3 shows benchmark timing measurements for the histogram build functions. 


Table 7-3. Histogram Build Mean Execution Times (Microseconds) Using TestImagel.bmp 


CPU AvxBuildlmageHistogramCpp AvxBuildlmageHistogram 
i7-4790S 277 230 
i9-7900X 255 199 
i7-8700K 241 191 


Image Thresholding 


Image thresholding is an image-processing technique that creates a binary image (i.e., an image with only 
two colors) from a grayscale image. This binary (or mask) image signifies which pixels in the original image 
are greater than a predetermined or algorithmically derived intensity threshold value. Figure 7-7 illustrates 
a thresholding operation. Mask images are often employed to perform additional calculations using the 
grayscale pixels values of the original image. For example, one typical use of the mask image that’s shown 
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in Figure 7-7 is to compute the mean intensity value of all above-threshold pixels in the original image. The 
application of a mask image simplifies calculating the mean since it facilitates the use of simple Boolean 
expressions to exclude unwanted pixels from the computations. 


Original Grayscale Image Mask Image After Thresholding 


Figure 7-7. Sample grayscale and mask images 


Source code example Ch07 08 demonstrates how to calculate the mean intensity of image 
pixels above a specified threshold. It also shows how to call a C++ function from an assembly language 
function. Listing 7-8 shows the source code for example Ch07 08. 


Listing 7-8. Example Ch07 08 


#pragma once 
#include <cstdint> 


// Image threshold data structure. This structure must agree with the 
// structure that's defined in Cho7 08 .asm 


struct ITD 

( 
uint8 t* m PbSrc; // Source image pixel buffer 
uint8 t* m PbMask; // Mask mask pixel buffer 
uint32 t m NumPixels; // Number of source image pixels 
uint32 t m NumMaskedPixels; // Number of masked pixels 
uint32 t m SumMaskedPixels; // Sum of masked pixels 
uint8 t m Threshold; // lmage threshold value 
uint8 t m Pad[3]; // Available for future use 
double m MeanMaskedPixels; // Mean of masked pixels 

B 


// Functions defined in Ch07_08.cpp 
extern bool AvxThresholdImageCpp(ITD* itd); 
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extern bool AvxCalcImageMeanCpp(ITD* itd); 
extern "C" bool IsValid(uint32 t num pixels, const uint8 t* pb src, const uint8 t* pb mask); 


// Functions defined in Ch07 08 .asm 
extern "C" bool AvxThresholdImage (ITD* itd); 
extern "C" bool AvxCalcImageMean (ITD* itd); 


// Functions defined in Cho7 08 BM.cpp 
extern void AvxThreshold BM(void); 


// Miscellaneous constants 
const uint8 t c TestThreshold - 96; 


#include "stdafx.h" 
#include «cstdint» 
#include «iostream» 
#include «iomanip» 
#include "Cho7 08.h" 
#include "AlignedMem.h" 
#include "ImageMatrix.h" 


using namespace std; 
extern "C" uint32 t c NumPixelsMax - 16777216; 


bool IsValid(uint32 t num pixels, const uint8 t* pb src, const uint8 t* pb mask) 


( 


const size t alignment - 16; 


// Make sure num pixels is valid 

if ((num pixels == 0) || (num pixels > c NumPixelsMax)) 
return false; 

if ((num pixels % 64) != 0) 
return false; 


// Make sure image buffers are properly aligned 

if (!AlignedMem::IsAligned(pb src, alignment)) 
return false; 

if (!AlignedMem::IsAligned(pb mask, alignment)) 
return false; 


return true; 


) 


bool AvxThresholdImageCpp(ITD* itd) 


{ 
uint8 t* pb src = itd-»m PbSrc; 
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uint8 t* pb mask = itd-»m PbMask; 
uint8 t threshold - itd-»m Threshold; 
uint32 t num pixels - itd-»m NumPixels; 


// Verify pixel count and buffer alignment 
if (!IsValid(num pixels, pb src, pb mask)) 
return false; 


// Threshold the image 
for (uint32 t i = 0; i « num pixels; i++) 


*pb_mask++ = (*pb src++ > threshold) ? Oxff : 0x00; 


return true; 


bool AvxCalcImageMeanCpp(ITD* itd) 


{ 


} 


uint8 t* pb src = itd-»m PbSrc; 
uint8 t* pb mask - itd-»m PbMask; 
uint32 t num pixels - itd-»m NumPixels; 


// Verify pixel count and buffer alignment 
if (!IsValid(num pixels, pb src, pb mask)) 
return false; 


// Calculate mean of masked pixels 
uint32 t sum masked pixels - 0; 


uint32 t num masked pixels - 0; 


for (uint32 t i = 0; i « num pixels; i++) 


{ 

uint8 t mask val = *pb mask++; 

num masked pixels += mask val & 1; 

sum masked pixels += (*pb src++ & mask val); 
) 


itd-»m NumMaskedPixels - num masked pixels; 
itd-»m SumMaskedPixels - sum masked pixels; 


if (num masked pixels > 0) 
itd-»m MeanMaskedPixels 
else 
itd-»m MeanMaskedPixels 


(double)sum masked pixels / num masked pixels; 


-1.0; 


return true; 


void AvxThreshold(void) 


{ 


const wchar t* fn src = L"..\\Ch07 Data\\TestImage2.bmp"; 
const wchar t* fn maski = L"Cho7 08 AvxThreshold TestImage2 Mask1.bmp"; 
const wchar t* fn mask2 = L"Cho7 08 AvxThreshold TestImage2 Mask2.bmp"; 
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ImageMatrix im src(fn src); 

int im h = im src.GetHeight(); 

int im w = im src.GetWidth(); 

ImageMatrix im maski(im h, im w, PixelType::Gray8); 
ImageMatrix im mask2(im h, im w, PixelType::Gray8); 
ITD itd1, itd2; 


itdi.m PbSrc = im src.GetPixelBuffer«uint8 t»(); 
itdi.m PbMask = im mask1.GetPixelBuffer<uint8 t»(); 
itdi.m NumPixels = im src.GetNumPixels(); 

itdi.m Threshold = c TestThreshold; 


itd2.m PbSrc = im src.GetPixelBuffer«uint8 t»(); 
itd2.m PbMask - im mask2.GetPixelBuffer«uint8 t»(); 
itd2.m NumPixels = im src.GetNumPixels(); 

itd2.m Threshold - c TestThreshold; 


// Threshold image 
bool rci = AvxThresholdImageCpp(&itd1); 
bool rc2 = AvxThresholdImage (&itd2); 


if (!rc1 || !rc2) 
cout << "\nInvalid return code: "; 
cout << "rc = " << rci << " 
return; 


» 1Cc2 = "«« rc2 << ‘\n': 


} 


im mask1.SaveToBitmapFile(fn mask1); 
im mask2.SaveToBitmapFile(fn mask2); 


// Calculate mean of masked pixels 
rci = AvxCalcImageMeanCpp(&itd1); 
rc2 = AvxCalcImageMean (&itd2); 


if (!rc1 || !rc2) 
cout << "\nInvalid return code: "; 
cout << "rc = " << rc1 << " 
return; 


,1€2 =" << rc2 << "Ni" 


} 


// Print results 
const int w = 12; 
cout «« fixed «« setprecision(4); 


cout << "\nResults for AvxThreshold\n\n"; 


cout << " C++ X86-AVX\n"; 
cout << "--------------------------------------------- \n"; 
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cout << "SumPixelsMasked: "3 
cout << setw(w) << itd1.m SumMaskedPixels << " ^"; 
cout << setw(w) << itd2.m SumMaskedPixels << ‘\n'; 
cout << "NumPixelsMasked: ^"; 
cout << setw(w) << itdi.m NumMaskedPixels << " ^"; 
cout << setw(w) << itd2.m NumMaskedPixels << ‘\n'; 
cout << "MeanMaskedPixels: "; 
cout << setw(w) << itd1.m MeanMaskedPixels << " "; 
cout << setw(w) << itd2.m MeanMaskedPixels << '\n'; 
} 
int main() 
1 
try 
1 
AvxThreshold(); 
AvxThreshold BM(); 
} 
catch (ess) 
1 
cout << "Unexpected exception has occurred\n"; 
} 
return 0; 
} 
; --------2--2-2.-22-2-2-22-2-22-2-2-2-222-2-2-2-2-2-2-222-2--2-2-2-2--2-22-2------- 
5 Cho7_08.asm 


include <MacrosX86-64-AVX.asmh> 


; Image threshold data structure (see Ch07 08.h) 


IID struct 
PbSrc qword ? 
PbMask qword ? 
NumPixels dword ? 
NumMaskedPixels dword ? 
SumMaskedPixels dword ? 
Threshold byte ? 
Pad byte 3 dup(?) 
MeanMaskedPixels real8 ? 
ITD ends 

.const 

align 16 
PixelScale byte 16 dup(80h) ;uint8 to int8 scale value 


CountPixelsMask byte 16 dup(01h) 
R8 MinusOne real8 -1.0 


;mask to count pixels 
;invalid mean value 
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. code 
extern IsValid: proc 


; extern "C" bool AvxThresholdImage (ITD* itd); 


3 
; Returns: 0 = invalid size or unaligned image buffer, 1 = success 


AvxThresholdImage_ proc frame 
_CreateFrame TI_,0,0,rbx 


_EndProlog 
; Verify the arguments in the ITD structure 
mov rbx,rcx ;copy itd ptr to non-volatile register 
mov ecx, [rbx«ITD.NumPixels] ;ecx = num pixels 
mov rdx, [rbx«ITD.PbSrc] ;rdx = pb src 
mov r8,[rbx«ITD.PbMask] ;18 = pb mask 
sub rsp,32 ;allocate home area for IsValid 
call IsValid ;validate args 
or al,al 
jz Done ;jump if invalid 


; Initialize registers for processing loop 


mov ecx, [rbx«ITD.NumPixels] ;ecx = num pixels 

shr ecx,6 ;ecx = number of 64b pixel blocks 
mov rdx, [rbx«ITD.PbSrc] ;rdx = pb src 

mov r8,[rbx«ITD.PbMask] ;18 - pb mask 

movzx r9d,byte ptr [rbx+ITD.Threshold] ;r9d = threshold 

vmovd xmmi,r9d ;xmm1[7:0] = threshold 

vpxor xmmo,xmmo,xmmo ;mask for vpshufb 

vpshufb xmmi,xmm1, xmmo 3xmm1 = packed threshold 


vmovdqa xmm4,xmmword ptr [PixelScale]  ;packed pixel scale factor 
vpsubb xmm5, xmm1, xmm4 ;scaled threshold 


; Create the mask image 


QQ: vmovdqa xmmo,xmmword ptr [rdx] ;original image pixels 
vpsubb xmm1,xmmO, xmm4 ;scaled image pixels 
vpcmpgtb xmm2,xmm3i,xmm5 ;mask pixels 
vmovdqa xmmword ptr [r8],xmm2 ;save mask result 


vmovdqa xmmO,xmmword ptr [rdx+16] 
vpsubb xmm1,xmmO , xmm4 

vpcmpgtb xmm2,xmm1,xmm5 

vmovdqa xmmword ptr [r8416],xmm2 


vmovdqa xmmo,xmmword ptr [rdx+32] 
vpsubb xmm1,xmmO, xmm4 

vpcmpgtb xmm2,xmm1, xmm5 

vmovdqa xmmword ptr [r8+32],xmm2 


268 


CHAPTER 7 ™ AVX PROGRAMMING — PACKED INTEGERS 


vmovdqa xmmO,xmmword ptr [rdx+48] 
vpsubb xmm1,xmmO , xmm4 

vpcmpgtb xmm2, xmm1, xmm5 

vmovdqa xmmword ptr [r8448],xmm2 


add rdx,64 

add r8,64 ;update pointers 

sub ecx,1 ;update counter 

jnz GB ;repeat until done 

mov eax,1 ;set success return code 
Done: _DeleteFrame rbx 

ret 


AvxThresholdImage_ endp 


D 
; Macro UpdateBlockSums 


3 


_UpdateBlockSums macro disp 
vmovdqa xmmO,xmmword ptr [rdx+disp] ;xmmo = 16 image pixels 
vmovdqa xmm1,xmmword ptr [r8+disp] ;xmmi = 16 mask pixels 


vpand xmm2,xmm1, xmm8 ;xmm2 = 16 mask pixels (0x00 or 0x01) 
vpaddb xmm6, xmm6 , xmm2 ;update block num masked pixels 

vpand xmm2,xmmo,xmm1 ;zero out unmasked image pixel 
vpunpcklbw xmm3,xmm2,xmm9 ;promote image pixels from byte to word 


vpunpckhbw xmm4, xmm2 , xmm9 

vpaddw xmm4, xmm4, xmm3 

vpaddw xmm7 , xmm7 , xmm4 ;update block sum mask pixels 
endm 


; extern "C" bool AvxCalcImageMean (ITD* itd); 


3 
; Returns: 0 = invalid image size or unaligned image buffer, 1 = success 


AvxCalcImageMean_ proc frame 
_CreateFrame CIM_,0,64,rbx 
_SaveXmmRegs xmm6 ,xmm7 , xmm8 , xmm9 


_EndProlog 
; Verify the arguments in the ITD structure 
mov rbx,rcx ;rbx = itd ptr 
mov ecx, [rbx«ITD.NumPixels] ;ecx = num pixels 
mov rdx, [rbx«ITD.PbSrc] ;rdx = pb src 
mov r8,[rbx+ITD.PbMask] ;18 - pb mask 
sub rsp,32 ;allocate home area for IsValid 
call IsValid ;validate args 
or al,al 
jz Done ;jump if invalid 
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; Initialize registers for processing loop 


mov ecx, [rbx«ITD.NumPixels] ;ecx = num pixels 

shr ecx,6 ;ecx = number of 64b pixel blocks 
mov rdx, [rbx«ITD.PbSrc] ;rdx = pb src 

mov r8, [rbx«ITD.PbMask] ;18 = pb mask 

vmovdqa xmm8,xmmword ptr [CountPixelsMask] ;mask for counting pixels 
vpxor xmm9, xmm9 , xmm9 ;xmm9 = packed zero 

xor r10d,r10d ;r10d = num masked pixels (1 dword) 
Vpxor xmm5,xmm5, xmm5 ;sum masked pixels (4 dwords) 


;Calculate num mask pixels and sum mask pixels 
LP1: vpxor xmm6,xmm6,xmm6 ;num masked pixels tmp (16 byte values) 
vpxor xmm7,xmm7,xmm7 ;sum masked pixels tmp (8 word values) 


_UpdateBlockSums O 
_UpdateBlockSums 16 
_UpdateBlockSums 32 
 UpdateBlockSums 48 


; Update num masked pixels 
vpsrldq xmmo,xmm6,8 


vpaddb xmm6 ,xmm6 , xmmO ;num mask pixels tmp (8 byte vals) 
vpsrldq xmmO, xmm6 , 4 

vpaddb xmmé6, xmm6 , xmmO ;num mask pixels tmp (4 byte vals) 
vpsrldq xmmo, xmm6 , 2 

vpaddb xmmé6, xmm6 , xmmO ;num mask pixels tmp (2 byte vals) 
vpsrldq xmmO, xmm6, 1 

vpaddb xmm6, xmm6 , xmmo ;num mask pixels tmp (1 byte val) 
vpextrb eax, xmm6,0 

add r10d,eax ;num mask pixels += num mask pixels tmp 


; Update sum masked pixels 
vpunpcklwd xmmO,xmm7 , xmm9 ;promote sum mask pixels tmp to dwords 
vpunpckhwd xmm2,xmm7 , xmm9 
vpaddd xmm5,xmm5 , xmmO 


vpaddd xmm5,xmm5, xmm1 ;sum mask pixels += sum masked pixels tmp 
add rdx,64 ;update pb src pointer 

add r8,64 ;update pb mask pointer 

sub rcx,1 ;update loop counter 

jnz LP1 ;repeat if not done 


; Compute mean of masked pixels 
vphaddd xmmO, xmm5 , xmm5 
vphaddd xmm1,xmmO , xmmO 
vmovd eax,xmmi ;eax = final sum mask pixels 


test r10d,r10d ;is num mask pixels zero? 
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jz NoMean ;if yes, skip calc of mean 
vcvtsi2sd xmmO,xmmO,eax ;xmmO = sum masked pixels 
vcvtsi2sd xmm1,xmm1,r10d ;xmm1 = num masked pixels 
vdivsd xmm2,xmmo, xmm1 ;xmm2 = mean masked pixels 
jmp GF 
NoMean: vmovsd xmm2,[R8 MinusOne] juse -1.0 for no mean 
@@: mov [rbx+ITD.SumMaskedPixels ] , eax ;save sum masked pixels 
mov [rbx+ITD.NumMaskedPixels],r10d ;save num masked pixels 
vmovsd [rbx+ITD.MeanMaskedPixels],xmm2  ;save mean 
mov eax,1 ;set success return code 
Done: | RestoreXmmRegs xmm6,xmm7,xmm8 , xmm9 
_DeleteFrame rbx 
ret 
AvxCalcImageMean ^ endp 
end 


The algorithm that's used in example Ch07 08 consists of two phases. Phase 1 constructs the mask image 
that's shown in Figure 7-7. Phase 2 computes the mean intensity of all pixels in the grayscale image whose 
corresponding mask image pixel is white (i.e., above the specified threshold). The file Ch07 08.h that's shown 
in Listing 7-8 defines a structure named ITD that maintains data required by the algorithm. Note this structure 
contains two count values: m NumPixels and m NumMaskedPixels. The former value is the total number of 
image pixels, while the latter value represents the number of image pixels greater than m Threshold. 

The C++ code in Listing 7-8 contains separate thresholding and mean calculating functions. The 
function AvxThresholdImageCpp constructs the mask image by comparing each pixel in the grayscale image 
to the threshold value that's specified by itd-»m Threshold. If a grayscale image pixel is greater than this 
value, its corresponding pixel in the mask image is set to Oxff; otherwise, the mask image pixel is set to 
0x00. The function AvxCalcImageMeanCpp uses this mask image to calculate the mean intensity value of all 
grayscale image pixels greater than the threshold value. Note that the for loop in this function computes 
num mask pixels and sum mask pixels using simple Boolean expressions instead of logical compare 
operations. The former technique is often faster and easier to implement using SIMD arithmetic. 

Listing 7-8 also shows assembly language implementations of the thresholding and mean calculating 
functions. Following its prolog, the function AvxThresholdImage validates the arguments in the supplied 
ITD structure by calling the C++ function IsValid. Prior to the call instruction, AvxThresholdImage loads 
the required argument values for IsValid into the appropriate registers and allocates a home area using 
asub rsp,32 instruction. After argument validation, the movzx r9d,byte ptr [rbx+ITD.Threshold] 
instruction loads the threshold value into register R9D. The ensuing vpshufb xmm1, xmm1, xmmO instruction 
"broadcasts" the threshold value to all byte positions in register XMM1. The vpshufb instruction uses 
the low-order four bits of each byte in the second source operand as an index to permute the bytes in the 
destination operand (a zero is copied if the high-order bit is set in the seconds source operand byte). 

Figure 7-8 illustrates this process. The packed threshold value is then scaled using a vpsubb instruction. The 
reason for doing this is explained in the next paragraph. 
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Initial values 


wwe pe) es E ERIS ES ES RR ERES eo 
SNS EGISSET 
Ress ee GST GS ES 


vpshufb xmm3,xmmO, xmm1 ;vpshufb execution example 
Pe e e [o o To Te [ [| [Ts] os 
vpshufb xmm4, xmm0, xmm2 ;generate a packed threshold 


Figure 7-8. Execution examples of the instruction vpshufb 


The processing loop in function AvxThreshholdImage uses the vpcmpgtb (Compare Packed Signed 
Integers for Greater Than) instruction to create the mask image. This instruction performs pairwise 
compares of the byte elements in the two source operands. If a byte in the first source operand is greater 
than the corresponding byte in the second operand, the destination operand byte is set to Oxff; otherwise, 
the destination operand byte is set to 0x00. Figure 7-9 illustrates execution of the vpcmpgtb instruction. It is 
important to note that vpcmpgtb executes its compares using signed integer arithmetic. This means that the 
pixels values in the grayscale image, which are unsigned byte values, must be re-scaled for compatibility 
with the vpcmpgtb instruction. The vpsubb instruction remaps the image's grayscale pixels values from [0, 
255] to [-128, 127]. This is also the reason that a vpsubb instruction was used on the packed threshold value 
prior to the start of the loop. Following each compare operation, the vmovdqa instruction saves the mask 
pixels to the specified buffer. Similar to example Ch07 04, the function AvxThresholdImage uses a partially 
unrolled processing loop to handle 64 pixels per iteration. 


Initial values 


ese sss]. EJET +e 
PPP E> PP EERE ERET ERO DE 


vpcmpgtb xmm2,xmmO, xmm1 


nom fom] eT Te Teese Te 8 TS Te e] 


Figure 7-9. Execution of the instruction vpcmpgtb 


The assembly language function AvxCalcImageMean also begins by validating its arguments using 
the C++ function IsValid. Following argument validation, the xor r10d,r10d and vpxor xmm5, xmm5 , xmm5 
instructions initialize num masked pixels and sum masked pixels (four doublewords) to zero, respectively. 
The processing loop in function AvxCalcImageMean uses a macro named UpdateBlockSums to compute 
the intermediate values num masked pixels tmp and sum masked pixels tmp for a block of 64 pixels. 
This macro performs its calculations using packed byte and word arithmetic, which reduces the number 
of byte to doubleword size-promotions that must be carried out. Figure 7-10 illustrates the arithmetic and 
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Boolean operations that are performed by UpdateBlockSums. The values num masked pixels (R10D) and 
sum masked pixels (XMM5) are then updated and the processing loop repeats until all pixels have been 
processed. 


CountPixelMask 


xmm8 


packed zeros 


| 
xmm9 


xmm6 


xmm7 


vmovdqa xmm0,xmmword ptr [rdx] ;pixel values 


vmovdqa xmml,xmmword ptr [r8] 


;mask values 


xmmi 
xmm2 


xmm6 


ps [oT [Te eT Ts eT] som 


vpunpcklbw xmm3, xmm2,xmm9 


NEN 


vpunpckhbw xmm4, xmm2, xmm9 


TET ES 


vpaddw xmm4,xmm4, xmm3 


xmm3 
xmm4 
112 | 199 | 200 220 0 0 420 xmm4 


xmm7 


Figure 7-10. Masked pixel sum and pixel count calculations performed by macro UpdateBlockSums 
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Following completion of its processing loop, function AvxCalcImageMean calculates the final mean 
intensity value using scalar double-precision floating-point arithmetic. Note that num mask pixels is tested 
prior to calculating the mean in order to avoid a division-by-zero error. Here is the output for source code 
example Ch07 08: 


Results for AvxThreshold 


C++ X86-AVX 
SumPixelsMasked: 23813043 23813043 
NumPixelsMasked: 138220 138220 
MeanMaskedPixels: 172.2836 172.2836 


Running benchmark function AvxThreshold BM - please wait 
Benchmark times save to file Ch07 08 AvxThreshold BM CHROMIUM. csv 


Table 7-4 shows timing measurements for the source code example Ch07 08. Note that the 
measurements in this table are for an entire image thresholding and mean calculation sequence. 


Table 7-4. Mean Execution Times (Microseconds) to Perform Image Thresholding and Mean Calculation 
Using TestImage2.bmp 


CPU C++ Assembly Language 
i7-4790S 289 50 
i9-7900X 250 40 
i7-8700K 242 39 


Summary 


Here are the key learning points for Chapter 7: 


e Thevpadd[b|w|d|q] instructions perform packed addition. The vpadds[b|w] and 
vpaddus[b|w] instructions perform packed signed and unsigned saturated addition. 


e The vpsub[b|w|d|q] instructions perform packed subtraction. The vpsubs[b|w] 
and vpsubus[b|w] instructions perform packed signed and unsigned saturated 
subtraction. 


e Thevpmul[h|1]w instructions carry out multiplication using packed word operands. 
The vpmuldg and vpmulld instructions carry out multiplication using packed 
doubleword operands. 


e  Thevpsll[w|d|q] and vpsrl[w|d|q] instructions execute logical left and right shifts 
using packed operands. The vpsra[w|d|q] instructions execute arithmetic right 
shifts using packed operands. The vps[1|1]dq instructions execute logical left and 
right shifts using 128-bit wide operands. 


e Assembly language functions can use the vpand, vpor, and vpxor instructions to 
perform bitwise AND, inclusive OR, and exclusive OR operations using packed 
integer operands. 
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The instructions vpextr[b|w|d|q] extract an element value from a packed operand. 
The vpinsr[b|w|d|q] instructions insert an element value into a packed operand. 


The vpunpckl [ bw| dw | dq] and vpunpckh [bw | dw | dq] instructions unpack and 
interleave the contents of their two source operands. These instructions are 
frequently used to size-promote packed integer operands. The vpackus [bw] dw] 
instructions size-reduce packed integer operands using unsigned saturated 
arithmetic. 


The vpminu[b|w|d] and vpmaxu[b |w|d] instructions perform packed unsigned 
integer minimum-maximum compares. 


The vpshufb instruction rearranges the bytes of a packed operand according to a 
control mask. 


The vpcmpgt [b |w|d |q] instructions perform signed integer greater than compares 
using packed operands. 


Aligning the target of a jump instruction to a 16-byte boundary often results in faster 
executing for loops. 
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Advanced Vector Extensions 2 


In the previous four chapters, you learned about the architecture and processing capabilities of AVX. 
These chapters explicated AVX's register sets, data types, and instructions. They also included numerous 
source code examples that illustrated how to perform scalar floating-point arithmetic, packed floating- 
point computations, and packed integer calculations. Many of the packed floating-point and packed 
integer source code examples exemplified important SIMD programming strategies and techniques whose 
exploitation often results in faster executing code. 

This chapter explains the architecture and computational resources of Advanced Vector Extensions 
2 (AVX2). You'll learn about AVX2's augmented capabilities for processing packed floating-point and 
packed integer operands. You'll also review important details regarding recent x86 platform instruction set 
extensions, including half-precision floating-point conversions, fused-multiply-add (FMA) operations, and 
new general-purpose register instructions. 

The material presented in this chapter assumes that you have a solid understanding of AVX. If you feel 
that your understanding of AVX's register sets, data types, or SIMD processing capabilities is lacking in any 
way, you may want to review the relevant sections in the previous chapters before proceeding. 


AVX2 Execution Environment 


AVX2 uses the same YMM and XMM register sets as AVX (see Figure 4-6). AVX2 also uses the MXCSR 
control-status register to signal floating-point arithmetic errors, configure rounding options, and control 

the generation of floating-point exceptions (see Figure 4-11). Like AVX, AVX2 supports floating-point SIMD 
operations using 128-bit or 256-bit wide operands containing either single-precision or double-precision 
values. AVX2 extends the packed integer processing capabilities of AVX to include both 128-bit and 256-bit 
wide operands (AVX only supports 128-bit wide integer operands). When used with a 256-bit wide packed 
integer operand, an AVX2 instruction can simultaneously process 32 byte, 16 word, 8 doubleword, or 4 
quadword values. AVX2 also adds a number of useful instructions that administer packed floating-point and 
packed integer operands. You'll learn more about these instructions later in this chapter. 

AVX2 instructions use the same instruction syntax as AVX. Most AVX2 instructions employ a three- 
operand format that consists of two source operands and one destination operand. Nearly all AVX2 
instruction source operands are non-destructive. This means that source operands are not modified during 
instruction execution, except in cases where the destination operand register is the same as one of the 
source operand registers. A small set of AVX2 instructions employ a third immediate source operand that's 
typically used as a control mask. 

The alignment requirements for AVX2 operands in memory are the same as AVX. Except for data 
transfer instructions that explicitly reference an aligned operand in memory (e.g., vmovdga, vmovap[d | s], 
etc.), proper alignment of an AVX2 operand in memory is not mandatory. However, 128-bit wide operands in 
memory should always be aligned to a 16-byte boundary whenever possible in order to maximize processing 
performance. Similarly, 256-bit wide operands should be aligned to a 32-byte boundary. 
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AVX2 Packed Floating-Point 


AVX2 expands the packed floating-point processing capabilities of AVX with the addition of data gather 
operations. The vgatherdp[d|s] and vgatherqp[d|s] instructions load multiple elements from non- 
contiguous memory locations (usually an array) into an XMM or YMM register. These instructions use a 
special memory addressing mode called vector scale-index-base (VSIB). VSIB memory addressing employs 
the following components to specify element locations in memory: 


e Scale: The element size scale factor (1, 2, 4, or 8). 


e Index: A vector index register (XMM or YMM) that contains signed doubleword or 
signed quadword indices. 


e Base: A general-purpose register that points to the start of an array in memory. 
e Displacement: An optional fixed offset from the start of the array. 


Prior to the execution of a vgatherdp[d|s] or vgatherqp[d|s] instruction, the vector index register 
operand must be loaded with the correct indices. The processor uses these indices to select elements from 
the array. Figure 8-1 illustrates execution of the instruction vgatherdps xmmo, [rax+xmm1*4],xmm2. In this 
example, register RAX points to the start of an array containing single-precision floating-point values; register 
XMMI holds four signed doubleword array indices; and register XMM2 contains a copy control mask. The 
copy control mask determines whether or not the vgatherdps instruction copies a particular array element 
to the destination operand. If the most significant bit of a control mask element is set, the corresponding 
array element that's specified in the vector index register is copied to the destination operand; otherwise, 
the destination operand element is not modified. Following successful execution of a vgatherdp[d|s] or 
vgatherqp[d|s] instruction, the copy control mask register (which is a source operand) contains all zeros. 


RAX 


Data array in memory 


0 1 2 3 4 5 6 7 


Initial values 


0x80000000 0x80000000 0x00000000 0x80000000 xmm2 


vgatherdps xmmO0, [rax+xmm1*4] ,xmm2 


DE aa e] 
| 00000000 | 0x00000000 0x00000000 0x00000000 xmm2 


0x00000000 


Figure 8-1. Illustration of the vgatherdps instruction execution 
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The destination operand and second source operand (the copy control mask) of a vgatherdp[d|s] 
or vgathergp[d|s] instruction must be an XMM or YMM register. The first source operand specifies the 
VSIB components (i.e., base register, vector index register, scale factor, and optional displacement). The 
vgatherdp[d|s] or vgatherqp[d|s] instructions do not perform any checks for invalid indices. An invalid 
index is any vector index register value that directs a gather instruction to load an element from a memory 
location that's outside the limits of the array. Using an invalid index will yield an incorrect result and possibly 
cause the processor to generate an exception. 

The other notable AVX2 packed floating-point enhancement involves the vbroadcasts[d|s] 
instructions. On processors that support AVX2, the source operand for these instructions can be an XMM 
register (AVX only supports vbroadcasts[d|s] source operands in memory). When used in this manner, 
the vbroadcasts [d|s] instructions copy the low-order double-precision or single-precision floating-point 
element of an XMM register to each element position in the destination operand. 


AVX2 Packed Integer 


As mentioned earlier in this chapter, AVX2 extends the packed integer capabilities of AVX to support both 128- 
bit and 256-bit wide operands. On systems that support AVX2, most packed integer instructions can use either 
the XMM or YMM registers as operands. The most notable exception to this rule are the vpextr[b|w|d|q] 
(Extract Integer Value) and vpinsr[b|w|d|q] (Insert Integer Value) instructions, which cannot be used 

with a YMM register operand. AVX2 also adds a number of new packed integer instructions that have no 
corresponding AVX (or x86-SSE) counterpart. Table 8-1 lists these instructions in alphabetical order. 


Table 8-1. Summary of New AVX2 Packed Integer Instructions 


Mnemonic Description 

vbroadcasti128 Broadcast 128 bits of integer data 

vextracti128 Extract 128 bits of integer data 

vinserti128 Insert 128 bits of integer data 

vpblendd Blend packed doublewords 

vpbroadcast[b|w|d|q] Broadcast integer value 

vperm2i128 Permute 128-bit integer data 

vperm[d|q] Permute packed integers 

vpgatherd[d|q] Packed integer gather using signed doubleword indices 
vpgatherq[d|q] Packed integer gather using signed quadword indices 
vpmaskmov [d | q] Conditional packed integer data move 

vpsllv[d|q] Left logical shift using individual element bit counts 
vpsravd Right arithmetic shift using individual element bit counts 
vpsrlv[d|q] Right logical shift using individual element bit counts 


The vpgatherd[d|q] and vpgatherq[d |q] instructions that are shown in Table 8-1 use the same VSIB 
memory addressing scheme as their floating-point counterparts. 
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X86 Instruction Set Extensions 


In recent years, a number of instruction set extensions besides AVX and AVX2 have been added to the x86 
platform. Many of these extensions include instructions that carry out specialized operations or accelerate 
the performance of specific algorithms. Table 8-2 lists the x86 instruction set extensions whose use is 
discussed and illustrated in subsequent chapters. It is important to keep in mind that all of the extensions 
shown in this table are distinct processor instruction sets. What this means from a programming perspective 
is that you should not assume that a particular instruction set or specific instruction is available based on 
whether or not the executing processor supports AVX or AVX2. The availability of a specific instruction set 
extension, including AVX and AVX2, should always be explicitly tested for using the cupid instruction. This is 
especially important for software compatibility with future processors from both AMD and Intel. You'll learn 
how to do this in Chapter 16. 


Table 8-2. Recent x86 Instruction Set Extensions 


Instruction Set Extension CPUID Feature Flag 
Enhanced unsigned integer addition ADX 

Advanced bit manipulation (group 1) BMII 

Advanced bit manipulation (group 2) BMI2 
Half-precision floating-point conversions F16C 
Fused-multiply-add FMA 

Count leading zero bits LZCNT 

Count set bits POPCNT 


The remainder of this section briefly describes the instruction set extensions that are shown in Table 8-2. 
Chapters 10 and 11 contain source code examples that illustrate how to use some ofthe instructions that are 
included in these extensions. Information regarding the instruction set extensions not shown in Table 8-2 can 
be found in the programming reference manuals published by AMD and Intel. Appendix A contains a list of 
these manuals. 


Half-Precision Floating-Point 


Recent processors from both AMD and Intel incorporate instructions that carry out half-precision floating- 
point conversions. Compared to a standard single-precision floating-point value, a half-precision floating- 
point value is a reduced-precision floating-point number that contains three fields: an exponent (5 bits), 

a significand (11 bits), and a sign bit. Each half-precision floating-point value is 16 bits wide; the leading 
digit of the significand is implied. Compatible processors include instructions that can convert packed 
half-precision floating-point values to packed single-precision floating-point and vice versa. Table 8-3 
shows these instructions. Half-precision floating-point values are primarily intended to reduce data storage 
space requirements, either in memory or on a physical device. The drawbacks of using half-precision 
floating-point values include reduced precision and limited range. Processors that support the conversion 
instructions shown in Table 8-3 do not include instructions for performing common arithmetic operations 
such as addition, subtraction, multiplication, and division using half-precision floating-point values. 
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Table 8-3. Half-Precision Floating-Point Conversion Instructions 


Mnemonic Description 
vcvtph2ps Convert half-precision floating-point to single-precision floating-point 
vcvtps2ph Convert single-precision floating-point to half-precision floating-point 


Fused-Multiply-Add (FMA) 


Modern processors from both AMD and Intel also include instructions that perform FMA operations. A FMA 
instruction combines multiplication and addition (or subtraction) into a single operation. More specifically, 
a fused-multiply-add (or fused-multiply-subtract) calculation performs a floating-point multiplication 
followed by a floating-point addition (or subtraction) using a single rounding operation. For example, 
consider the expression d = b * c + a. Using standard floating-point arithmetic, the processor initially 
calculates the productb * c, which includes a rounding operation. This is followed by a floating-point 
addition computation that also includes a rounding operation. If the expression is evaluated using FMA 
arithmetic, the processor does not round the intermediate productb * c. Rounding is carried out only once 
using the calculated product-sum b * c + a. FMA instructions are often used to improve the performance 
and accuracy of multiply-accumulate computations such as dot products and matrix-vector multiplications. 
Many signal-processing algorithms also make extensive use of FMA operations. 

FMA instruction mnemonics employ a three-digit operand-ordering scheme that specifies which 
operands to use for multiplication and addition (or subtraction). In this scheme, all three instruction 
operands are used as source operands. The first mnemonic digit specifies the source operand to use as 
the multiplicand; the second digit specifies the source operand to use as the multiplier; and the third digit 
specifies the source operand that is added to (or subtracted from) the product. For example, consider the 
instruction vfmadd132sd xmm4,xmm5,xmm6 (Fused Multiply-Add of Scalar Double-Precision Floating-Point 
Values). In this example, registers XMM4, XMM5, and XMM6 are source operands 1, 2, and 3, respectively. 
The vfmadd132sd instruction computes xmm4[63:0] * xmm6[63:0] + xmm5[63:0], rounds the product-sum 
according to the rounding mode specified by MXCSR.RC, and saves the final result to xmm4[63:0]. 

The x86 FMA instruction set extension supports operations using scalar or packed floating-point 
values, both single-precision and double-precision. Packed FMA operations can be performed using either 
the XMM or YMM registers. The XMM (YMM) registers support packed FMA calculations using two (four) 
double-precision or four (eight) single-precision floating-point values. Scalar FMA calculations are carried 
out using the XMM register set. For all FMA instructions, the first and second source operands must be a 
register. The third source operand can be a register or a memory location. If an FMA instruction uses an 
XMM register as a destination operand, the high-order 128 bits of the corresponding YMM register are set to 
zero. FMA instructions carry out their sole rounding operation using the mode that's specified by MXCSR. 
RC, as explained in the previous paragraph. 

Table 8-4 shows the FMA instruction set. The instruction mnemonics in this table use the following two- 
letter suffixes: pd (packed double-precision floating-point), ps (packed single-precision floating-point), sd 
(scalar double-precision floating-point), and ss (scalar single-precision floating-point). The symbols src1, 
SIC2, and src3 denote the three source operands; the destination operand des is always the same as src1. 
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Table 8-4. Overview of FMA Instructions 


Subgroup Mnemonic Operation 
VFMADD vfmadd132[pd|ps|sd|ss] des = srci * src3 + src2 
vfmadd213[pd|ps | sd|ss] des = src2 * src1 + src3 
vfmadd231[pd|ps | sd|ss] des = src2 * src3 + srci 
VFMSUB vfmsub132[pd|ps | sd|ss] des = srci * src3 - src2 
vfmsub213[pd |ps | sd|ss] des = src2 * src1 - src3 
vfmsub231[pd|ps | sd|ss] des = src2 * src3 - srci 
VFMADDSUB vfmaddsub132[pd|ps] des = src1 * src3 + src2 (odd elements) 
des = src1 * src3 - src2(even elements) 
vfmaddsub213 [pd |ps] des = src2 * srci + src3 (odd elements) 
des = src2 * src1 - src3(even elements) 
vfmaddsub231[pd|ps] des = src2 * src3 + src1 (odd elements) 
des = src2 * src3 - src1(even elements) 
VEMSUBADD vfmsubadd132[pd|ps] des = srci * src3 - src2 (odd elements) 
des = srci * src3 + src2 (even elements) 
vfmsubadd213[pd|ps] des = src2 * srci - src3 (odd elements) 
des = src2 * src1 + src3 (even elements) 
vfmsubadd231[pd|ps] des = src2 * src3 - src1 (odd elements) 
des = src2 * src3 + src1 (even elements) 
VENMADD vfnmadd132[pd|ps|sd|ss] des = -(src1 * src3) + src2 
vfnmadd213[pd |ps | sd|ss] des = -(src2 * src1) + src3 
vfnmadd231[pd|ps | sd|ss] des = -(src2 * src3) + srci 
VENMSUB vfnmsub132[pd|ps|sd|ss] des = -(src1 * src3) - src2 
vfnmsub213[pd|ps|sd|ss] des = -(src2 * src1) - src3 
vfnmsub231[pd|ps |sd|ss] des = -(src2 * src3) - src1 


The FMA instructions that are shown in Table 8-4 are often identified as FMAS instructions by many 
CPU feature detection utilities and online documentation sources. Some AMD processors also include 
supplemental FMAA instructions, which carry out their FMA operations using three source operands and 
one destination operand (the three-digit operand ordering scheme is not used). These instructions are not 
shown in Table 8-4. 


General-Purpose Register Instruction Set Extensions 


Recent enhancements to the x86 platform have also included a number of general-purpose register 
instruction set extensions. The ADX, BMII, BMI2, LZCNT, and POPCNT instruction set extensions 
support enhanced unsigned integer arithmetic, advanced bit manipulations, and flagless register rotate 
and shift operations (a flagless operation does not update any of the status flags in RFLAGS). Many of 
these instructions are designed to accelerate the performance of specific algorithms such as large-integer 
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arithmetic and data encryption. Some of these general-purpose register instructions use a three-operand 
assembly-language syntax that's similar to AVX. Table 8-5 lists the instructions that comprise the ADX, BMII, 
BMI2, LZCNT, and POPCNT extensions. 


Table 8-5. Overview of ADX, BMI1, BMI2, LZCNT, and POPCNT Instructions 


Mnemonic CPUID Feature Flag Description 

adcx ADX Unsigned integer addition with carry flag 
adox ADX Unsigned integer addition with overflow flag 
andn BMI1 Bitwise AND of inverted operand1 with operand2 
bextr BMII Bitfield extract 

blsi BMII Extract lowest set bit 

blsmsk BMII Get mask up to lowest set bit 

blsr BMII Reset lowest set bit 

bzhi BMI2 Zero high bits 

lzcnt LZCNT Count number of leading zero bits 

mulx BMI2 Flagless unsigned integer multiplication 
pdep BMI2 Parallel bits deposit 

pext BMI2 Parallel bits extract 

popcnt POPCNT Count number of set bits 

rorx BMI2 Flagless rotate right 

Sarx BMI2 Flagless arithmetic shift right 

shlx BMI2 Flagless logical shift left 

shrx BMI2 Flagless logical shift right 

tzcnt BMI1 Count number of trailing zero bits 


Summary 


Here are the key learning points of Chapter 8: 
e AVX2 uses the same register sets, data types, and instruction syntax as AVX. 


e AVX2 extends the packed integer processing capabilities of AVX to support 
operations using 256-bit wide operands. 


e AVX2 includes new packed integer processing instructions that perform broadcast, 
permute, and variable bit-shift operations. 


e Thevgather[d|q]p[d|s] and vpgather[d|q] [d|q] instructions load floating-point or 
integer values into an XMM or YMM register from non-contiguous locations in memory. 
These instructions use the VSIB addressing mode to carry out their operations. 
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The vcvtph2ps and vcvtps2ph instructions perform conversions between packed 
half-precision to single-precision floating-point values. 


All FMA instructions execute a floating-point multiplication followed by a floating- 
point addition (or subtraction) using a single rounding operation. The x86 FMA 
instruction set extension supports a variety of FMA operations using both scalar and 
packed single-precision or double-precision floating-point values. 


The ADX, BMII, BMI2, LZCNT, and POPCNT instruction set extensions include 
instructions that support enhanced unsigned integer addition, advanced bit 
manipulation, and flagless shift and rotate operations. 


CHAPTER 9 


AVX2 Programming - Packed 
Floating-Point 


In Chapter 6, you learned how to use the AVX instruction set to perform packed floating-point operations 
using the XMM register set and 128-bit wide operands. In this chapter, you learn how carry out packed 
floating-point operations using the YMM register set and 256-bit wide operands. The chapter begins with a 
simple example that demonstrates the basics of packed floating-point arithmetic and YMM register use. This 
is followed by three source code examples that illustrate how to perform packed calculations with floating- 
point arrays. 

Chapter 6 also presented source code examples that exploited the AVX instruction set to accelerate 
matrix transposition and multiplication using single-precision floating-point values. In this chapter, you 
learn how to perform these same calculations using double-precision floating-point values. You also study 
a source code example that computes the inverse of a matrix. The final two source code examples in this 
chapter explain how to perform data blends, permutes, and gathers using packed floating-point operands. 

You may recall that the source code examples in Chapter 6 used only XMM register operands with AVX 
instructions. This was done to avoid information overload and maintain a reasonable chapter length. Nearly 
all AVX floating-point instructions can use either the XMM or YMM registers as operands. Many of the 
source code examples in this chapter will run on a processor that supports AVX. The function names in these 
examples use the prefix Avx. Similarly, source code examples that required an AVX2-compatible processor 
use the function name prefix Avx2. You can use one of the freely-available tools listed in Appendix A to 
determine whether your computer supports only AVX or both AVX and AVX2. 


Packed Floating-Point Arithmetic 


Listing 9-1 shows the source code for example Ch09_01. This example illustrates how to perform common 
arithmetic operations using 256-bit wide single-precision and double-precision floating-point operands. 
It also illustrates how to use the vzeroupper instruction and several MASM directives for 256-bit wide 
operands. 


Listing 9-1. Example Ch09_01 


#pragma once 
#include <string> 
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#include <cstdint> 
#include <sstream> 
#include <iomanip> 


struct YmmVal 


{ 
public: 
union 
{ 
int8 t m I8[32]; 
int16 t m I16[16]; 
int32 t m I32[8]; 
int64 t m I64[4]; 
uint8 t m U8[32]; 
uint16 t m U16[16]; 
uint32 t m U32[8]; 
uint64 t m U64[4]; 
float m F32[8]; 
double m F64[4]; 

n 
[l------------------------------------------------ 
// Cho9_01.cpp 
[l------------------------------------------------ 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 
#define USE MATH DEFINES 


#include «math.h» 
#include "YmmVal.h" 


using namespace std; 


extern "C" void AvxPackedMathF32 (const YmmVal& a, const YmmVal& b, YmmVal c[8]); 
extern "C" void AvxPackedMathF64 (const YmmVal& a, const YmmVal& b, YmmVal c[8]); 


void AvxPackedMathF32 (void) 
{ 
alignas(32) YmmVal a; 
alignas(32) YmmVal b; 
alignas(32) YmmVal c[8]; 


a.m F32[0] = 36.0f; b.m F32[0] = -0.1111111f; 
a.m F32[1] = 0.03125f; b.m F32[1] = 64.0f; 

a.m F32[2] = 2.0f; b.m F32[2] = -0.0625f; 

a.m F32[3] = 42.0f; b.m F32[3] = 8.666667f; 
a.m F32[4] = 7.0f; b.m F32[4] = -18.125f; 
a.m F32[5] = 20.5f; b.m F32[5] = 56.0f; 

a.m F32[6] = 36.125f; b.m F32[6] = 24.0f; 

a.m F32[7] = 0.5f; b.m F32[7] = -98.6f; 
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AvxPackedMathF32 (a, b, c); 
cout << ("\nResults for AvxPackedMathF32\n"); 


cout << "a[0]: " << a.ToStringF32(0) << ‘\n'; 
cout << "b[o]: " << b.ToStringF32(0) << '\n'; 
cout << "addps[o] << c[0].ToStringF32(0) << ‘\n'; 
cout << "subps[0] << c[1].ToStringF32(0) << ‘\n'; 
cout << "mulps[o]: << c[2].ToStringF32(0) << '\n'; 
cout << "divps[0] " << c[3].ToStringF32(0) << '\n'; 
b[ c[4].ToStringF32(0) << '\n'; 
c[5]. 
c[6]. 
c[7]. 


cout << "absps b[0]: " << 

cout << "sqrtps a[0]:" << ToStringF32(0) << '\n'; 
cout << "minps[o]: " << ToStringF32(0) << ‘\n'; 
cout << "maxps[0]: " << ToStringF32(0) << ‘\n'; 


cout << ‘\n'; 


cout << "a[1]: " << a.ToStringF32(1) << '\n'; 
cout << "b[1]: " << b.ToStringF32(1) << ‘\n'; 
cout << "addps[1]: " << c[0].ToStringF32(1) << ‘\n'; 
cout << "subps[1]: " << c[1].ToStringF32(1) << ‘\n'; 
cout << "mulps[1]: " << c[2].ToStringF32(1) << ‘\n'; 
cout << "divps[1]: " << c[3].ToStringF32(1) << ‘\n'; 
cout << "absps b[1]: " << c[4].ToStringF32(1) << ‘\n'; 
cout << "sqrtps a[1]:" << c[5].ToStringF32(1) << 'Wn'; 
cout << "minps[1]: " << c[6].ToStringF32(1) << ‘\n'; 
cout << "maxps[1]: " << c[7].ToStringF32(1) << ‘\n'; 

} 

void AvxPackedMathF64(void) 

{ 


alignas(32) YmmVal a; 
alignas(32) YmmVal b; 
alignas(32) YmmVal c[8]; 


a.m_F64[0] = 2.0; b.m_F64[0] = M PI; 
a.m F64[1] = 4.0 ; b.m F64[1] = M E; 
a.m F64[2] = 7.5; b.m F64[2] = -9.125; 
a.m F64[3] = 3.0; b.m F64[3] = -M PI; 


AvxPackedMathF64 (a, b, c); 
cout << ("\nResults for AvxPackedMathF64\n") ; 


cout << "a[0]: " << a.ToStringF64(0) << '\n'; 

cout << "b[o]: " << b.ToStringF64(0) << '\n'; 

cout << "addpd[o] " << c[o].ToStringF64(0) << '\n'; 
cout << "subpd[0]:  " << c[1].ToStringF64(0) << '\n'; 
cout << "mulpd[0]:  " << c[2].ToStringF64(0) << '\n'; 
cout << "divpd[0]: X" << c[3].ToStringF64(0) << '\n'; 
cout << "abspd b[0]: " << c[4].ToStringF64(0) << '\n'; 
cout << "sqrtpd a[0]:" << c[5].ToStringF64(0) << ‘\n'; 
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cout << "minpd[0]: " << c[6].ToStringF64(0) << ‘\n'; 
cout << "maxpd[0]: " << c[7].ToStringF64(0) << ‘\n'; 


cout << ‘\n'; 


cout << "a[1]: " << a.ToStringF64(1) << '\n'; 
cout << "b[1]: " << b.ToStringF64(1) << '\n'; 
cout << "addpd[1] "<< c[0].ToStringF64(1) << '\n'; 
cout << "subpd[1] "<< c[1].ToStringF64(1) << '\n'; 
cout << "mulpd[1]: " << c[2].ToStringF64(1) << '\n'; 
cout << "divpd[1] " << c[3].ToStringF64(1) << '\n'; 
cout << "abspd b[1]: " << c[4].ToStringF64(1) << '\n'; 
cout << "sqrtpd a[1]:" << c[5].ToStringF64(1) << '\n'; 
] 
] 


cout << "minpd[1]: " << c[6].ToStringF64(1) << '\n'; 
cout << "maxpd[1]: " << c[7].ToStringF64(1) << '\n'; 
j 
int main() 
{ 
AvxPackedMathF32(); 
AvxPackedMathF64() ; 
return 0; 
} 
| 
H Cho9_01.asm 


3 


; Mask values used to calculate floating-point absolute values 
.const 

AbsMaskF32 dword 8 dup(7fffffffh) 

AbsMaskF64 qword 4 dup(7fffffffffffffffh) 


; extern "C" void AvxPackedMathF32 (const YmmVal& a, const YmmVal& b, YmmVal c[8]); 


.code 
AvxPackedMathF32_ proc 


; Load packed SP floating-point values 
vmovaps ymmO,ymmword ptr [rcx] ;ymmo = 
vmovaps ymm1,ymmword ptr [rdx] symm. 


l 
* 
£D 


*b 


; Packed SP floating-point addition 
vaddps ymm2, ymmo, ymm1 
vmovaps ymmword ptr [r8],ymm2 


; Packed SP floating-point subtraction 


vsubps ymm2,ymmo,ymm1 
vmovaps ymmword ptr [r8+32],ymm2 
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; Packed SP floating-point multiplication 
vmulps ymm2,ymmo,ymmi 
vmovaps ymmword ptr [r8+64],ymm2 


; Packed SP floating-point division 
vdivps ymm2,ymmO, ymm1 
vmovaps ymmword ptr [r8+96],ymm2 


; Packed SP floating-point absolute value (b) 
vandps ymm2,ymm1,ymmword ptr [AbsMaskF32] 
vmovaps ymmword ptr [r8+128],ymm2 


; Packed SP floating-point square root (a) 
vsqrtps ymm2,ymmo 
vmovaps ymmword ptr [r8+160],ymm2 


; Packed SP floating-point minimum 
vminps ymm2, ymmo, ymm1 
vmovaps ymmword ptr [r8+192],ymm2 


; Packed SP floating-point maximum 
vmaxps ymm2, ymmo, ymm1 
vmovaps ymmword ptr [r8+224],ymm2 


vzeroupper 
ret 
AvxPackedMathF32_ endp 
; extern "C" void AvxPackedMathF64 (const YmmVal& a, const YmmVal& b, YmmVal c[8]); 
AvxPackedMathF64_ proc 
; Load packed DP floating-point values 


vmovapd ymmO,ymmword ptr [rcx] ;ymmo = *a 
vmovapd ymm1,ymmword ptr [rdx] symm. = *b 


; Packed DP floating-point addition 
vaddpd ymm2, ymmo, ymm1 
vmovapd ymmword ptr [r8],ymm2 


; Packed DP floating-point subtraction 
vsubpd ymm2, ymmo, ymm1 
vmovapd ymmword ptr [r8+32],ymm2 


; Packed DP floating-point multiplication 
vmulpd ymm2,ymmo,ymmi 
vmovapd ymmword ptr [r8+64],ymm2 


; Packed DP floating-point division 


vdivpd ymm2, ymmo, ymm1 
vmovapd ymmword ptr [r8496],ymm2 
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; Packed DP floating-point absolute value (b) 
vandpd ymm2,ymm1,ymmword ptr [AbsMaskF64] 
vmovapd ymmword ptr [r8+128],ymm2 


; Packed DP floating-point square root (a) 
vsqrtpd ymm2, ymmo 
vmovapd ymmword ptr [r84160],ymm2 


; Packed DP floating-point minimum 
vminpd ymm2, ymmo, ymm1 
vmovapd ymmword ptr [r84192],ymm2 


; Packed DP floating-point maximum 
vmaxpd ymm2, ymmo, ymm1 
vmovapd ymmword ptr [r8+224],ymm2 


vzeroupper 
ret 

AvxPackedMathF64  endp 
end 


Listing 9-1 begins with the declaration of a C++ structure named YmmVal that's declared in the header 
file YmmVal.h. This structure is similar to the XmmVal structure that you saw in Chapter 6. YnmVal contains 
a publicly-accessible anonymous union that facilitates packed operand data exchange between functions 
written in C++ and x86 assembly language. The members of this union correspond to the packed data types 
that can be used with a YMM register. The structure YnmVal also includes several formatting and display 
functions (the source code for these member functions is not shown). 

The C++ code for example Ch09 01 starts with declarations for the assembly language functions 
AvxPackedMathF32 and AvxPackedMathF64 . These functions carry out various packed single-precision 
and double-precision floating-point arithmetic operations using the supplied YmmVal arguments. 

Following the assembly language function declarations is the function AvxPackedMathF32. This function 
starts by initializing YmmVal variables a and b. Note that the C++ specifier alignas(32) is used with each 
YmmVal declaration. This specifier instructs the C++ compiler to align each YmmVal variable on a 32-byte 
boundary. Following YmmVal variable initialization, AvxPackedMathF32 calls the assembly language function 
AvxPackedMathF32 to perform the required arithmetic. The results are then streamed to cout. The function 
AvxPackedMathF64 is the double-precision floating-point counterpart of AvxPackedMathF32. 

Near the top of the assembly language code in Listing 9-1 is a . const section that defines packed 
constant values for calculating floating-point absolute values. The text dup is a MASM operator that allocates 
and optionally initializes multiple data values. In the current example, the statement AbsMaskF32 dword 
8 dup(7fffffffh) allocates storage space for eight doubleword values and each value is initialized to 
Ox7fffffff. The following statement, AbsMaskF64 qword 4 dup(7fffffffffffffffh), allocates four 
quadwords of Ox7fffffffffffffff. Note that neither of these 256-bit wide operands is preceded by an 
align statement, which means that they may not be properly aligned in memory. The reason for this is that 
the MASM align directive does not support 32-byte alignment within a . const, .data, or .code section. 
Later in this chapter, you learn how to define a custom segment of constant values that supports 32-byte 
alignment. 

Following the . const section, the first instruction of AvxPackedMathF32 , vmovaps ymmo, ymmword 
ptr [rcx], loads argument a (i.e., the eight floating-point values of YmmVal a) into register YMMO. The 
vmovaps can be used here since YmmVal a was defined using the alignas(32) specifier in the C++ code. The 
operator ymmword ptr directs the assembler to treat the memory location pointed to by RCX as a 256-bit 
wide operand. Use ofthe ymmword ptr operator is optional in this instance and employed to improve code 
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readability. The ensuing vmovaps ymm1,ymmword ptr [rdx] instruction loads b into register YMMI. The 
vaddps ymm2, ymmo, ymm1 instruction that follows sums the packed single-precision floating-point values in 
YMMO and YMM]; it then saves the result to YMM2. The vmovaps ymmword ptr [r8],ymm2 instruction saves 
the packed sums to c[0]. 

The ensuing vsubps, vmulps, and vdivps instructions carry out packed single-precision floating- 
point subtraction, multiplication, and division. This is followed by a vandps ymm2,ymm1,ymmword ptr 
[AbsMaskF32] instruction that calculates packed absolute values using argument b. The remaining 
instructions in AvxPackedMathF32 calculate packed single-precision floating-point square roots, 
minimums, and maximums. 

Prior to its ret instruction, the function AvxPackedMath32 uses a vzeroupper instruction, which 
zeros the high-order 128 bits of each YMM register. As explained in Chapter 4, the vzeroupper instruction 
is needed here to avoid potential performance delays that can occur whenever the processor transitions 
from executing x86-AVX instructions that use 256-bit wide operands to executing x86-SSE instructions. Any 
assembly language function that uses one or more YMM registers and is callable from code that potentially 
uses x86-SSE instructions should always ensure that a vzeroupper instruction is executed before program 
control is transferred back to the calling function. You'll see additional examples of vzeroupper instruction 
use in this and subsequent chapters. 

The organization of function AvxPackedMathF64 is similar to AvxPackedMathF32 .AvxPackedMathF64 
carries out its calculations using the double-precision versions of the same instructions that are used in 
AvxPackedMathF32_. Here is the output for source code example Ch09 01: 


Results for AvxPackedMathF32 


a[0]: 36.000000 0.031250 | 2.000000 42.000000 
b[o]: -0.111111 64.000000 | -0.062500 8.666667 
addps [0]: 35.888889 64.031250 | 1.937500 50.666668 
subps[0]: 36.111111 -63.968750 | 2.062500 33.333332 
mulps[0]: -4.000000 2.000000 | -0.125000 364.000000 
divps[0]: -324.000031 0.000488 | -32.000000 4.846154 
absps b[o]: 0.111111 64.000000 | 0.062500 8.666667 
sqrtps a[0]: 6.000000 0.176777 | 1.414214 6.480741 
minps[0]: -0.111111 0.031250 | -0.062500 8.666667 
maxps[0] : 36.000000 64.000000 | 2.000000 42.000000 
a[1]: 7.000000 20.500000 | 36.125000 0.500000 
b[1]: -18.125000 56.000000 | 24.000000 -98.599998 
addps [1] -11.125000 76.500000 | 60.125000 -98.099998 
subps [1] 25.125000 -35.500000 | 12.125000 99.099998 
mulps[1] -126.875000 1148.000000 | 867.000000 -49.299999 
divps[1]: -0.386207 0.366071 | 1.505208 -0.005071 
absps b[1] 18.125000 56.000000 | 24.000000 98.599998 
sqrtps a[1]: 2.645751 4.527693 | 6.010407 0.707107 
minps[1]: -18.125000 20.500000 | 24.000000 -98.599998 
maxps[1]: 7.000000 56.000000 | 36.125000 0.500000 


a[0]: 2.000000000000 | 4.000000000000 
b[0]: 3.141592653590 | 2.718281828459 
addpd[o]: 5.141592653590 | 6.718281828459 
subpd[0]: -1.141592653590 | 1.281718171541 
mulpd[0]: 6.283185307180 | 10.873127313836 
divpd[o]: 0.636619772368 | 1.471517764686 
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abspd b[o]: 3.141592653590 | 2.718281828459 
sqrtpd a[0]: 1.414213562373 | 2.000000000000 
minpd[0]: 2.000000000000 | 2.718281828459 
maxpd[0] : 3.141592653590 | 4.000000000000 
a[1]: 7.500000000000 | 3.000000000000 
b[1]: -9.125000000000 | -3.141592653590 
addpd[1]: -1.625000000000 | -0.141592653590 
subpd[1]: 16.625000000000 | 6.141592653590 
mulpd[1]: -68.437500000000 | -9.424771960769 
divpd[1]: -0.821917808219 | -0.954929658551 
abspd b[1]: 9.125000000000 | 3.141592653590 
sqrtpd a[1]: 2.738612787526 | 1.732050807569 
minpd[1]: -9.125000000000 | -3.141592653590 
maxpd[1]: 7.500000000000 | 3.000000000000 


Packed Floating-Point Arrays 


In previous chapters, you learned how to carry out integer and floating-point array calculations using the 
general-purpose and XMM register sets. In this section, you learn how to perform floating-point array 
operations using the YMM register set. 


Simple Calculations 


Listing 9-2 shows the source code for example Ch09 02. This example illustrates how to perform simple 
array calculations using 256-bit wide packed floating-point operands. It also demonstrates how to detect 
and exclude invalid array elements from packed calculations. Source code example Ch09 02 is an array 
implementation of example Ch05 02 from Chapter 5, which calculated sphere surface areas and volumes. 
In that example, the assembly language function CalcSphereAreaVolume computed the surface area and 
volume of a single sphere. In this example, the sphere radii are passed via an array to calculating functions 
coded using C++ and assembly language. To make the example a little more interesting, both the C++ and 
assembly language calculating functions test for radii less than zero. If an invalid radius is detected, the 
calculating functions set the corresponding elements in the surface area and volume arrays to QNaN. 


Listing 9-2. Example Ch09 02 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 
#include «random» 
#include «limits» 

#define USE MATH DEFINES 
#include <math.h> 
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using namespace std; 
extern "C" void AvxCalcSphereAreaVolume (float* sa, float* vol, const float* r, size t n); 


extern "C" float c PI F32 


(float)M PI; 
extern "C" float c ONaN F32 = 


2 - numeric limits«float»::quiet NaN(); 


void Init(float* r, size t n, unsigned int seed) 


{ 
uniform int distribution«» ui dist (1, 100}; 
default random engine rng {seed}; 
for (size t i = 0; i< n; i++) 
r[i] = (float)ui dist(rng) / 10.0f; 
// Set invalid radii for test purposes 
if (n > 2) 
( 
r[2] = -r[2]; 
r[n / 4] = -r[n / 4]; 
r[n / 2] = -r[n / 2]; 
r[n/ 4 * 3] = -x[n / 4 * 3]; 
r[n - 2] = -r[n - 2]; 
) 
) 


void AvxCalcSphereAreaVolumeCpp(float* sa, float* vol, const float* r, size t n) 


{ 


for (size t i = 0; i< n; i++) 


{ 
if (r[i] « 0.0f) 
sa[i] = vol[i] = c ONaN F32; 
else 
{ 
sali] = r[i] * r[i] * 4.0f * c PI F32; 
vol[i] = sa[i] * r[i] / 3.0f; 
) 
) 


) 


void AvxCalcSphereAreaVolume(void) 


{ 


const size t n = 21; 


alignas(32) float 
alignas(32) float 
alignas(32) float 
alignas(32) float 
alignas(32) float 


r[n]; 
sa1[n]; 
voli[n]; 
sa2[n]; 
vol2[n]; 
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Init(r, n, 93); 


AvxCalcSphereAreaVolumeCpp(sa1, voli, r, n); 
AvxCalcSphereAreaVolume (sa2, vol2, r, n); 


cout << "\nResults for AvxCalcSphereAreaVolume\n"; 
cout << fixed; 


const float eps = 1.0e-6f; 


for (size t i = 0; i< n; i++) 


{ 


cout << setw(2) << i << ": "; 
cout << setprecision(2); 
cout << setw(5) << r[i] << 
cout << setprecision(6); 
cout << setw(12) << sa1[i] << " $ 
cout << setw(12) << sa2[i] << " | 5; 
cout << setw(12) << voli[i] «« "  "; 
cout «« setw(12) «« vol2[i]; 


" | uy 
3 


bool bo 
bool b1 


(fabs(sa1[i] - sa2[i]) > eps); 
(fabs(voli[i] - vol2[i]) > eps); 


if (bo || b1) 
cout << " Compare discrepancy"; 
cout << ‘\n'; 


} 


int main() 


{ 
AvxCalcSphereAreaVolume( ) ; 


return 0; 


include «cmpequ.asmh» 
include «MacrosX86-64-AVX.asmh» 


.const 
r4 3p0 real4 3.0 
I4 4po real4 4.0 


extern c PI F32:real4 
extern c ONaN F32:real4 
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; extern "C" void AvxCalcSphereAreaVolume (float* sa, float* vol, const float* r, size t n); 


.code 

AvxCalcSphereAreaVolume proc frame 
_CreateFrame CC ,0,64 
_SaveXmmRegs xmm6 ,xmm7 , xmm8 , xmm9 


 EndProlog 
; Initialize 
vbroadcastss ymmo,real4 ptr [r4 4po] ;packed 4.0 
vbroadcastss ymmi,real4 ptr [c PI F32] ;packed PI 
vmulps ymm6,ymmo,ymm3i ;packed 4.0 * PI 
vbroadcastss ymm7,real4 ptr [r4 3p0] ;packed 3.0 
vbroadcastss ymm8,real4 ptr [c ONaN F32] ;packed QNaN 
vxorps ymm9,ymm9, ymm9 ;packed 0.0 
xor eax, eax ;common offset for arrays 
cmp r9,8 
jb FinalR ;skip main loop if n « 8 
; Calculate surface area and volume values using packed arithmetic 
@@: vmovdqa ymmO,ymmword ptr [r8+rax] ;load next 8 radii 
vmulps ymm2,ymm6 , ymmo 34.0 * PI * x 
vmulps ymm3,ymm2, ymmo “450 * PI* r*r 
vcmpps ymmi,ymmO,ymm9,CMP LT ;ymm1 = mask of radii < 0.0 
vandps ymm4,ymmi,ymm8 ;set surface area to ONaN for radii « 0.0 
vandnps ymm5,ymm1,ymm3 ;keep surface area for radii »- 0.0 
vorps ymm5,ymm4, ymm5 3 final packed surface area 
vmovaps ymmword ptr[rcx+rax], ymm5 ;save packed surface area 
vmulps ymm2,ymm3, ymmo 54-0 ® PIE Y Ey er 
vdivps ymm3,ymm2,ymm7 5420.7 PI ey Foot or 320 
vandps ymm4,ymm1,ymm8 ;set volume to ONaN for radii « 0.0 
vandnps ymm5,ymm1, ymm3 ;keep volume for radii >= 0.0 
vorps ymm5,ymm4,ymm5 ;final packed volume 
vmovaps ymmword ptr[rdx4rax],ymms ;save packed volume 
add rax,32 ;rax = offset to next set of radii 
sub r9,8 
cmp r9,8 
jae @B ;repeat until n « 8 


; Perform final calculations using scalar arithmetic 
FinalR: test r9,r9 
jz Done ;skip loop of no more elements 
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@@: vmovss xmmO,real4 ptr [r8+rax] 
vmulss xmm2,xmmé , xmmO 34.0 * PI * x 
vmulss xmm3,xmm2,xmmO 54.0% PI + xr * 7 


vcmpss xmm1,xmmO, xmm9,CMP_LT 


vandps xmm4, xmm1, xmm8 
vandnps xmm5,xmm1, xmm3 
vorps xmm5,xmm4, xmm5 


vmovss real4 ptr[rcx+rax],xmm5 ;save surface area 
vmulss xmm2,xmm3,xmmo 34.0 * PL * £^ r'*r 
vdivss xmm3,xmm2,xmm7 34.0 * Pl * xr * r * x / 3.0 


vandps xmm4, xmm1, xmm8 
vandnps xmm5,xmm1, xmm3 
vorps xmm5,xmm4, xmm5 


vmovss real4 ptr[rdx+rax], xmm5 ;save volume 

add rax,4 

dec r9 

jnz GB ;repeat until done 


Done:  vzeroupper 


 RestoreXmmRegs xmm6 ,xmm7 , xmm8 , xmm9 
_DeleteFrame 
ret 
AvxCalcSphereAreaVolume  endp 
end 


The C++ code in Listing 9-2 includes a function named AvxCalcSphereAreaVolumeCpp. This function 
calculates sphere surface areas and volumes. The sphere radii are passed to AvxCalcSphereAreaVolumeCpp 
via an array. Prior to calculating a surface area or volume, the sphere's radius (r[i])is tested to verify that 
it's not negative. If the radius is negative, the corresponding elements in the surface area and volume arrays 
(sa[i] and vol[i])aresetto c ONaN F32. The remaining C++ code performs the necessary initializations, 
exercises the C++ and assembly language calculating functions, and displays the results. Note that the 
function AvxCalcSphereAreaVolume employs the alignas(32) specifier with each array declaration. 

The assembly language function AvxCalcSphereAreaVolume performs the same calculations as its C++ 
counterpart. Following its prolog, AvxCalcSphereAreaVolume uses a series of vbroadcastss instructions 
to initialize packed versions of the required constants. Prior to the start of the processing loop, a cmp 
19,8 instruction checks the value of n. The reason for this check is that the processing loop carries out 
eight surface area and volume calculations simultaneously using 256-bit wide operands. The jb FinalR 
conditional jump instruction skips the processing loop if there are fewer than eight radii to process. 

Each processing loop iteration begins with a vnovdqa ymmo, ymmword ptr [r8+rax] instruction that 
loads eight single-precision floating-point radii into register YMMO. The ensuing vmulps instructions 
calculate the sphere surface areas. The next instruction, vcmpps ymm1, ymmo, ymm9 , CMP LT, tests each sphere 
radii for a value less than 0.0 (register YMM9 contains packed 0.0). Recall that the vcmpps instruction 
signifies its results by setting elements in the destination operand to either 0x00000000 (false compare 
predicate) or Oxffffffff (true compare predicate). The vandps, vandnps, and vorps instructions that 
follow set the surface area of each sphere whose radius is less than 0.0 to c ONaN F32. Figure 9-1 illustrates 
this operation in greater detail. A vnovaps ymmword ptr[rcx«rax],ymms instruction saves the eight sphere 
surface area values to the array sa. 
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Packed constants 


ymmO 


| ymm3 


vempps ymml, ymm0, ymm9, CMP_LT 


vandps ymm4, ymm1, ymm8 


ymmi 
| 0.0 0.0 QNaN 0.0 0.0 ( 0.0 0.0. ymm4 


ymms 


vorps ymm5, ymm4, ymm5 


I 


Figure 9-1. Surface area QNaN assignment for spheres with radius less than 0.0 


Following the calculation of the surface areas, the vmulps ymm2,ymm3, ymmo and vdivps 
ymm3 , ymm2 , ymm7 instructions compute the sphere volumes. The processing loop uses another vandps, 
vandnps, and vorps instruction sequence to set the volume of any negative-radius sphere to c ONaN F32. 
These values are then saved to the array vol. The processing loop repeats until there are fewer than eight 
remaining radii. 

The next block of code computes sphere surface areas and volumes for the remaining (1-7) radii. Note 
that AvxCalcSphereAreaVolume carries out these calculations using scalar single-precision floating-point 
arithmetic. The scalar processing loop performs the same arithmetic and Boolean operations as the packed 
processing loop. Similar to the previous example, AvxCalcSphereAreaVolume uses a vzeroupper instruction 
immediately after the scalar processing loop. This instruction is needed since AvxCalcSphereAreaVolume | 
carried out its calculations using the YMM register set. When a vzeroupper instruction is required, it should 
always be positioned before any function epilog macros (e.g., RestoreXmmRegs and DeleteFrame)and the 
ret instruction. Here are the results for source code example Ch09 02: 
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Results for AvxCalcSphereAreaVolume 


0: 3.80 | 181.458389 181.458389 229.847290 229.847290 
1: 10.00 | 1256.637085 1256.637085 4188.790527 4188.790527 
2: -6.10 | nan nan nan nan 
3: 3.70 | 172.033630 172.033630 212.174805 212.174805 
4: 9.60 | 1158.116821 1158.116821 3705.973877 3705.973877 
5: -6.60 | nan nan nan nan 
6: 2.60 | 84.948662 84.948654 73.622169 73.622162 Compare discrepancy 
7: 9.30 | 1086.865479 1086.865479 3369.283203 3369.283203 
8: 9.00 | 1017.876038 1017 . 876038 3053.628174 3053.628174 
9: 5.80 | 422.732758 422.732758 817.283386 817.283386 
10: -2.90 | nan nan nan nan 
11: 8.10 | 824.479675 824.479675 2226.095215 2226.095215 
12: 3.00 | 113.097336 113.097336 113.097328 113.097328 
13: 8.00 | 804.247742 804.247742 2144.660645 2144.660645 
14: 1.40 | 24.630087 24.630085 11.494040 11.494039 Compare discrepancy 
15: -1.80 | nan nan nan nan 
16: 4.30 | 232.352219 232.352219 333.038177 333.038177 
17: 6.60 | 547.391113 547.391113 1204.260376 1204.260376 
18: 4.50 | 254.469009 254.469009 381.703522 381.703522 
19: -1.20 | nan nan nan nan 
20: 4.50 | 254.469009 254.469009 381.703522 381.703522 


The output for source code example Ch09 02 includes a couple of lines with the text “compare 
discrepancy.’ This text was generated by the compare code in AvxCalcSphereAreaVolume to exemplify the 
non-associativity of floating-point arithmetic. In this example, the functions AvxCalcSphereAreaVolumeCpp 
and AvxCalcSphereAreaVolume carried out their respective floating-point calculations using different 
operands orderings. For each sphere surface area, the C++ code calculates sa[i] = r[i] * r[i] * 4.0 
* c PI F32, while the assembly language code calculates sa[i] = 4.0 * c PI F32 * r[i] * r[i]. Tiny 
numerical discrepancies like this are not unusual when comparing floating-point values that are calculated 
using different operand orderings irrespective of the programming language. This is something that you 
should keep in mind if you're developing production code that includes multiple versions of the same 
calculating function (e.g., one coded using C++ and an AVX/AVX2 accelerated version that's implemented 
using x86 assembly language). 

Finally, you may have noticed that the function AvxCalcSphereAreaVolume handled invalid radii 
sans any x86 conditional jump instructions. Minimizing the number of conditional jump instructions in a 
function, especially data-dependent ones, often results in faster executing code. You'll learn more about 
jump instruction optimization techniques in Chapter 15. 


Column Means 


Listing 9-3 shows the source code for example Ch09 03. This example illustrates how to calculate the 
arithmetic mean of each column in a two-dimensional array of double-precision floating-point values. 


Listing 9-3. Example Ch09 03 


#include "stdafx.h" 
#include <iostream> 
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#include <iomanip> 
#include <random> 
#include <memory> 


using namespace std; 


extern "C" size t c NumRowsMax = 1024 * 1024; 
extern "C" size t c NumColsMax - 1024 * 1024; 


extern "C" bool AvxCalcColumnMeans (const double* x, size t nrows, size t ncols, double* 
col means); 


void Init(double* x, size t n, unsigned int seed) 


{ 
uniform int distribution«» ui dist (1, 2000}; 
default random engine rng (seed); 
for (size t i = 0; i< n; i++) 
x[i] = (double)ui dist(rng) / 10.0; 
} 


bool AvxCalcColumnMeansCpp(const double* x, size t nrows, size_t ncols, double* col means) 
{ 
// Make sure nrows and ncols are valid 
if (nrows == 0 || nrows > c NumRowsMax) 
return false; 
if (ncols == 0 || ncols > c NumColsMax) 
return false; 


// Set initial column means to zero 
for (size t i = 0; i < ncols; i++) 
col means[i] = 0.0; 


// Calculate column means 
for (size t i = 0; i < nrows; i++) 
1 
for (size t j = 0; j « ncols; j++) 
col means[j] += x[i * ncols + j]; 


) 


for (size t j = 0; j « ncols; j++) 
col means[j] /= nrows; 


return true; 


) 


void AvxCalcColumnMeans (void) 


( 


const size t nrows - 20; 
const size t ncols - 11; 
unique ptr«double[]» x (new double[nrows * ncols]}; 
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unique ptr«double[]» col means1 {new double[ncols]}; 
unique ptr«double[]» col means2 {new double[ncols]}; 


Init(x.get(), nrows * ncols, 47); 


bool rci = AvxCalcColumnMeansCpp(x.get(), nrows, ncols, col meansi.get()); 
bool rc2 - AvxCalcColumnMeans (x.get(), nrows, ncols, col means2.get()); 


cout << "Results for AvxCalcColumnMeans Nn"; 
if (Iret || !rc2) 


cout << "Invalid return code: "; 
cout << "rci = " << boolalpha << rci << ", "; 
cout << "rc2 = " << boolalpha << rc2 << '\n'; 


return; 


} 


cout << "\nTest Matrix\n"; 
cout << fixed << setprecision(1); 


for (size t i = 0; i < nrows; i++) 


{ 
cout << "row " << setw(2) << i; 
for (size t j = 0; j < ncols; j++) 
cout << setw(7) << x[i * ncols + j]; 
cout << '\n'; 
} 


cout << "\nColumn Means\n"; 
cout << setprecision(2); 


for (size t j = 0; j < ncols; j++) 


{ 
cout << "col means1[" << setw(2) << j << "] ="; 
cout << setw(10) << col meansi[j] «« " "5 
cout << "col means2[" << setw(2) << j << "] ="; 
cout << setw(10) << col means2[j] << ‘\n'; 
} 
} 
int main() 
{ 
AvxCalcColumnMeans(); 
return 0; 
} 
> mm 
$ Cho9_03.asm 


3 


; extern "C" bool AvxCalcColMeans (const double* x, size_t nrows, size_t ncols, double* 
col means) 
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extern c NumRowsMax:qword 
extern c NumColsMax:qword 


.code 
AvxCalcColumnMeans proc 


; Validate nrows and ncols 


xor eax,eax ;error return code (also col mean index) 


test rdx,rdx 


jz Done ;jump if nrows is zero 

cmp rdx,[c NumRowsMax] 

ja Done ;jump if nrows is too large 
test r8,r8 

jz Done ;jump if ncols is zero 

cmp r8,[c NumColsMax] 

ja Done ;jump if ncols is too large 


; Initialize elements of col means to zero 


vxorpd xmmo,xmmo, xmmO 


;xmmo[63:0] = 0.0 


QQ: vmovsd real8 ptr[r9*rax*8],xmmo ;col means[i] = 0.0 
inc rax 
cmp rax,r8 
jb GB ;repeat until done 


vcvtsi2sd xmm2,xmm2,rdx 


; Compute the sum of each column in x 


;convert nrows for later use 


LP1: mov r11,r9 ;111 = ptr to col means 
xor r10,r10 ;110 - col index 
LP2: mov rax,r10 ;rax = col index 
add rax,4 
cmp rax,r8 ;4 or more columns remaining? 
ja GF ;jump if no (col index + 4 > ncols) 


; Update col means using next four columns 


vmovupd ymmo,ymmword ptr [rcx] 
vaddpd ymmi,ymmo,ymmword ptr [r11] 
vmovupd ymmword ptr [r11],ymmi 
add r10,4 

add rcx,32 

add r11,32 

jmp NextColSet 


;load next 4 columns of current row 
;add to col means 

;save updated col means 

;col index += 4 

;update x ptr 

;update col means ptr 


@@: sub rax,2 
cmp rax,r8 32 or more columns remaining? 
ja @F ;jump if no (col index + 2 > ncols) 


; Update col means using next two columns 


vmovupd xmmO,xmmword ptr [rcx] 
vaddpd xmm1,xmmO,xmmword ptr [r11] 


;load next 2 columns of current row 
;add to col means 
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vmovupd xmmword ptr [r11],xmmi 
add r10,2 

add rcx,16 

add r11,16 

jmp NextColSet 


;save updated col means 
;col index += 2 

;update x ptr 

;update col means ptr 


; Update col means using next column (or last column in the current row) 


QQ: vmovsd xmmO,real8 ptr [rcx] 
vaddsd xmm1,xmmO,real8 ptr [r11] 
vmovsd real8 ptr [r11],xmmi 

inc r10 


add rcx,8 


NextColSet: 
cmp r10,r8 
jb LP2 
dec rdx 
jnz LP1 


; Compute the final col means 

@@: vmovsd xmmO,real8 ptr [r9] 
vdivsd xmm1,xmmoO, xmm2 
vmovsd real8 ptr [r9],xmmi 
add r9,8 
dec r8 
jnz GB 


mov eax,1 


Done:  vzeroupper 


ret 


AvxCalcColumnMeans_ endp 
end 


;load x from last column 
;add to col means 

;save updated col means 
;col index += 1 

;update x ptr 


;more columns in current row? 
;jump if yes 
;nrows -= 1 


;jump if more rows 


;xmmo = col means[i] 
;compute final mean 
;save col mean[i] 
;update col means ptr 
;ncols -- 1 

;repeat until done 


;set success return code 


Toward the top of the C++ code is a function named AvxCalcColumnMeansCpp. This function calculates 


the column means of a two-dimensional array using a straightforward set of nested for loops and some 
simple arithmetic. The function AvxCalcColumnMeans contains code that uses the C++ smart pointer class 
unique ptr«» to help manage its dynamically-allocated arrays. Note that storage space for the test array x 
is allocated using the C++ new operator, which means that the array may not be aligned on a 16- or 32-byte 
boundary. In this particular example, aligning the start of array x to a specific boundary would be oflittle 
benefit since it's not possible to align the individual rows or columns of a standard C++ two-dimensional 
array (recall that the elements of a two-dimensional C++ array are stored in a contiguous block of memory 
using row-major ordering as described in Chapter 2). 

The function AvxCalcColumnMeans also uses class unique ptr«» and the new operator for the one- 
dimensional arrays col means1 and col means2. Using unique ptr«» in this example simplifies the C++ 
code somewhat since its destructor automatically invokes the delete[] operator to release the storage 
space that was allocated by the new operator. If you're interested in learning more about the smart pointer 
class unique ptr«», Appendix A contains a list of C++ references that you can consult. The remaining code 
in AvxCalcColumnMeans invokes the C++ and assembly language column-mean calculating functions and 
streams the results to cout. 
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Following argument validation, the assembly language function AvxCalcColMeans initializes each 
element in col means to 0.0. These elements will maintain the intermediate column sums. In order to 
maximize throughput, the column summation code uses slightly different instruction sequences depending 
on the current column and the total number of columns in the array. For example, assume that array x 
contains seven columns. For each row, the elements of the first four columns in x can be added to col means 
using 256-bit wide packed addition; the elements of the next two columns can be added to col means using 
128-bit wide packed addition; and the final column element must be added to col means using scalar 


addition. Figure 9-2 illustrates this technique in greater detail. 
RCX R11 
Col 0 Value Col 0 Sum 
Col 1 Sum 


Col 5 Value 
Col 6 Value Col 6 Sum 


Row i of 2D-array x Instruction sequences col means array 


vmovupd ymm0, ymmword ptr [rcx] 
vaddpd ymml,ymmO0,ymmword ptr [ril] 
vmovapd ymmword ptr [r11],ymml 


vmovupd xmmO0,xmmword ptr [rcx] 
vaddpd xmmi,xmmO,xmmword ptr [r11] 
vmovapd xmmword ptr [rl11],xmml 


vmovsd xmm0, real 8 ptr[rcx] 
vaddsd xmml,xmm0,real8 ptr [r11] 
vmovsd real8 ptr [r11] ,xmm1 


Figure 9-2. Updating the co1 means array using different operand sizes 


The mov r11,r9 instruction next to the label LP1 is the starting point for adding elements in the current 
row of x to col means. This instruction initializes R11 to first entry in col means. The col index counter in 
register R10 is then set to zero. The instruction group near the label LP2 determines the number of columns 
remaining to be processed in the current row. If four or more columns remain, the next four elements 
from the current row are added to the column sums in col means. A vmovupd ymmo,ymmword ptr [rcx] 
instruction loads four double-precision floating-point values from x into YMMO (a vmovapd instruction is 
not used here since alignment of the elements is unknown). The ensuing vaddpd ymmi,ymmO,ymmword ptr 
[r11] instruction sums the current array elements with the corresponding elements in col means, and the 
vmovupd ymmword ptr [r11],ymmi instruction saves the updated results back to col means. The function's 
various pointers and counters are then updated in preparation for the next set of elements from the current 
row of x. 
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The summation code repeats the steps described in the previous paragraph until the number of array 
elements that remain in the current row is less than four. As soon as this condition is met, the elements in the 
remaining columns (if any) must be processed using 128-bit wide or 64-bit wide operands. This is the reason 
for the distinct blocks of code in AvxCalcColumnMeans that process four elements, two elements, or a single 
element per row. Following computation of the column sums, each element in col means is divided by n, 
which yields the final column mean. Here are the results for source code example Ch09 03: 


Results for AvxCalcColumnMeans 


Test Matrix 


row O 125.6 59.9 100.0 170.5 140.1 197.2 73.7 15.2 92.4 155.3 159.2 
row 1 77.6 105.4 45.0 176.8 65.9 12.3 189.1 102.0 56.2 112.8 17.2 
row 2 198.9 199.3 74.6 137.9 65.0 125.0 19.8 32.1 58.6 94.1 123.5 
row 3 1.7 29.1 99.1 200.0 109.0 123.7 130.0 125.3 146.2 90.6 52.2 
row 4 8.7 88.7 84.8 174.6 164.4 106.2 114.0 151.8 130.8 101.9 116.2 
row 5 42.7 130.5 180.4 199.4 196.6 99.7 163.6 34.2 5.5 146.1 108.5 
row 6 120.0 159.5 26.0 83.4 58.7 10.1 170.1 20.5 10.8 48.3 121.9 
row 7 148.9 148.4 142.0 106.6 198.4 60.3 72.1 137.8 74.5 75.7 44.8 
row 8 25.7 192.0 12.1 23.4 98.7 145.3 196.8 43.9 143.1 25.1 122.6 
row 9 5.4 134.7 165.1 61.8 46.7 183.3 173.7 146.9 76.5 186.2 24.9 
row 10 174.5 158.9 127.8 58.9 42.9 182.9 7.8 50.3 68.0 62.0 66.1 
row 11 47.3 166.2 8.2 74.2 98.5 12.4 179.0 100.2 29.7 167.4 155.2 
row 12 23.9 196.6 148.7 7.1 128.2 128.8 66.3 153.7 60.7 115.4 71.6 
row 13 103.4 184.3 161.5 57.9 199.2 79.3 28.1 73.1 12.5 71.3 100.4 
row 14 130.3 154.2 127.5 29.7 198.2 170.3 121.9 80.4 159.8 70.0 82.6 
row 15 26.7 45.6 67.7 109.7 5.1 96.2 188.7 100.7 48.3 164.2 75.4 
row 16 115.4 25.5 58.8 148.5 80.7 149.1 156.7 153.8 42.0 103.7 4.2 
row 17 67.9 161.5 16.9 102.1 71.3 3.9 104.7 97.2 181.8 182.0 155.1 
row 18 169.5 122.4 102.2 5:5 14.5 105.1 181.5 83.3 117.6 52.1 111.2 
row 19 47.1 146.9 21.0 8.6 130.3 24.7 95.7 6.7 159.9 38.8 82.6 
Column Means 

col means1[ 0] = 83.06 col means2[ 0] = 83.06 

col means1[ 1] = 130.48 col means2[ 1] = 130.48 

col means1[ 2] = 88.47 col means2[ 2] = 88.47 

col means1[ 3] = 96.68 col means2[ 3] = 96.68 

col means1[ 4] = 105.92 col means2[ 4] = 105.92 

col means1[ 5] = 100.79 col means2[ 5] = 100.79 

col means1[ 6] = 121.66 col means2[ 6] = 121.66 

col means1[ 7] = 85.46 col means2[ 7] = 85.46 

col means1[ 8] = 83.75 col means2[ 8] = 83.75 

col means1[ 9] = 103.15 col means2[ 9] = 103.15 

col means1[10] = 89.77 col means2[10] = 89.77 
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Correlation Coefficient 


The next source code example illustrates how to calculate a correlation coefficient using packed double- 
precision floating-point arithmetic. This example also demonstrates how to perform a few common auxiliary 
operations with packed floating-point operands, including 128-bit wide extractions and horizontal addition. 
Listing 9-4 shows the source code for example Ch09 04. 


Listing 9-4. Example Ch09 04 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 
#include <string> 
#include <random> 
#include "AlignedMem.h" 


using namespace std; 


extern "C" bool AvxCalcCorrCoef (const double* x, const double* y, size t n, double sums[5], 
double epsilon, double* rho); 


void Init(double* x, double* y, size t n, unsigned int seed) 


{ 
uniform int_distribution<> ui_dist {1, 999}; 
default random engine rng {seed}; 
for (size t i = 0; i< n; i++) 
{ 
x[i] = (double)ui dist(rng); 
y[i] = x[i] + (ui dist(rng) % 6000) - 3000; 
) 
) 


bool AvxCalcCorrCoefCpp(const double* x, const double* y, size t n, double sums[5], double 
epsilon, double* rho) 


{ 


const size t alignment = 32; 


// Make sure n is valid 
if (n == 0) 
return false; 


// Make sure x and y are properly aligned 

if (!AlignedMem::IsAligned(x, alignment)) 
return false; 

if (!AlignedMem::IsAligned(y, alignment)) 
return false; 
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// Calculate and save sum variables 
double sum x = 0, sum y = 0, sum xx = 0, sum yy = 0, sum xy = 0; 


for (size t i = 0; i< n; i++) 


{ 
sum x += x[i]; 
sum y += y[i]; 
sum xx += x[i] * x[i]; 
sum yy += y[i] * y[i]; 
sum xy += x[i] * y[i]; 
) 
sums[0] = sum x; 
sums[1] = sum y; 
sums[2] = sum xx; 
sums[3] = sum yy; 
sums[4] = sum xy; 


// Calculate rho 
double rho num = n * sum xy - sum x * sum y; 
double rho den - sqrt(n * sum xx - sum x * sum x) * sqrt(n * sum yy - sum y * sum y); 


if (rho den >= epsilon) 


{ 
*rho = rho num / rho den; 
return true; 
) 
else 
{ 
*rho = 0; 
return false; 
} 
} 
int main() 
{ 


const size t n = 103; 

const size t alignment = 32; 
AlignedArray«double» x aa(n, alignment); 
AlignedArray«double» y aa(n, alignment); 
double sums1[5], sums2[5]; 

double rhoi, rho2; 

double epsilon - 1.0e-12; 

double* x = x aa.Data(); 

double* y = y aa.Data(); 


Init(x, y, n, 72) 


bool rci = AvxCalcCorrCoefCpp(x, y, n, sumsi, epsilon, &rho1); 
bool rc2 = AvxCalcCorrCoef (x, y, n, sums2, epsilon, &rho2); 
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cout << "Results for AvxCalcCorrCoef\n\n"; 


if ( 
{ 


} 


Irc1 || !rc2) 


cout << “Invalid 
cout << "rc1 = " 
cout << "rc2 = 
return 1; 


int w = 14; 


stri 


cout 
cout 
cout 
cout 


cout 
cout 
cout 
cout 
cout 
cout 
retu 


; extern "C" bool AvxCalcCorrCoef (const double* x, const double* y, size t n, double 


ng sep(u * 3, '- 


fixed «« set 
"Value b 
sep << ‘\n'; 
"rho: T 


<< 
<< 
<< 
<< 


<< 
<< 
<< 
<< 


setprecision 
"sum x: " 

"sum y: 

"sum xx: 
«« "sum yy: 
«« "sum xy: 
rn 0; 


return code 
«« boolalpha 


"n. 
2 


<< rel << " 
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uy 
3 3 


<< boolalpha << rc2 << '\n'; 


DE 


precision(8); 
<< setw(w) << 


<< setw(w) << 


(1); 

<< setw(w) 
<< setw(w) 
<< setw(w) 
<< setw(w) 
<< setw(w) 
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"C++" << " " << setu(w) << "x86-AVX" << '\n'; 


rho1 << 


sums1[0] << 
sums1[1] << 
sums1[2] << 
sums1[3] << 
sums1[4] << 


include <MacrosX86-64-AVX.asmh> 


sums[5], double epsilon, double* rho) 


b 
; Return 


S 0 = error, 1 = success 


.code 


AvxCalcCorrCoef proc frame 


; Valida 


 CreateFrame CC 


_SaveXmmRegs xmm 
 EndProlog 


te arguments 
or r8,r8 

jz BadArg 
test rcx, 1fh 
jnz BadArg 
test rdx,1fh 
jnz BadArg 


,0,32 
6, xmm7 


;junp if n = 
;jump if x 


;junp if y 


<< setw(w) << rho2 << "\n\n"; 


<< 


<< 
"non «« 
"non 


<< 
"non «« 


is 


is 


setw(w) 
setw(w) 
setw(w) 
setw(w) 
setw(w) 


not aligned 


not aligned 


<< 
<< 
<< 
<< 
<< 


307 


CHAPTER 9 bn AVX2 PROGRAMMING — PACKED FLOATING-POINT 


; Initialize sum variables to zero 


vxorpd ymm3,ymm3,ymm3 
vxorpd ymm4,ymm4, ymm4. 
vxorpd ymm5,ymm5 , ymm5 
vxorpd ymm6,ymm6 , ymm6 
vxorpd ymm7, ymm7 , ymm7 
mov r10,r8 


cmp 18,4 
jb LP2 


;jump if n >= 


; Calculate intermediate packed sum variables 


LP1: 


vmovapd ymmo,ymmword ptr [rcx] 
vmovapd ymm1,ymmword ptr [rdx] 


vaddpd ymm3,ymm3 , ymmo 
vaddpd ymm4, ymm4, ymm1 


vmulpd ymm2, ymmo, ymm1 
vaddpd ymm7, ymm7, ymm2 


vmulpd ymmo, ymmo, ymmo 
vmulpd ymm1,ymm1,ymm1 
vaddpd ymm5,ymm5 , ymmo 
vaddpd ymm6,ymm6, ymm1 


add rcx, 32 
add rdx, 32 
sub r8,4 
cmp r8,4 
jae LP1 


or r8,r8 
jz FSV 


;ymmO = packed 
3ymm1 = packed 


;update packed 
;update packed 


;ymm2 = packed 
;update packed 


;ymmO = packed 
;ymmi = packed 
;update packed 
;update packed 


;update x ptr 
;update y ptr 
5n -= 4 

jis n >= 4? 
;jump if yes 


jis n == 0? 
;jump if yes 


; Update sum variables with final x & y values 


LP2: 
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vmovsd xmmO,real8 ptr [rcx] 
vmovsd xmm1,real8 ptr [rdx] 


vaddpd ymm3,ymm3 , ymmo 
vaddpd ymm4, ymm4 , ymm1 


vmulpd ymm2, ymmo, ymm1 
vaddpd ymm7, ymm7 , ymm2 


vmulpd ymmo, ymmo, ymmo 
vmulpd ymmi,ymm1i,ymmi 
vaddpd ymm5,ymm5 , ymmo 
vaddpd ymm6,ymm6, ymm1 


;xmmo [63:0] 
;xmm1[63:0] 


;update packed 
;update packed 


;ymm2 - packed 
;update packed 


;ymmO = packed 
3ymm1 = packed 
;update packed 
;update packed 


sum x 
sum y 
sum xx 


sum yy 
sum xy 


188 n <= 3 


x values 
y values 


sum x 
sum y 


xy values 
sum xy 


xx values 
yy values 
sum xx 


sum yy 


x[i], ymmo[255:64] 
y[i], ymm1[255:64] 


sum x 
sum y 


xy values 
sum xy 


xx values 
yy values 
sum xx 


sum yy 
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add rcx,8 ;update x ptr 

add rdx,8 ;update y ptr 

sub r8,1 ;n -= 1 

jnz LP2 ;repeat until done 


; Calculate final sum variables 
FSV: vextractf128 xmmo,ymm3,1 
vaddpd xmmi, xmmo, xmm3 
vhaddpd xmm3,xmm2, xmm1 ;xmm3[63:0] = sum x 


vextractf128 xmmo,ymm4,1 
vaddpd xmmi, xmmo, xmm4 
vhaddpd xmm4, xmm1, xmm1 ;xmm4[63:0] = sum y 


vextractf128 xmmo,ymm5,1 
vaddpd xmm1, xmmO, xmm5 
vhaddpd xmm5, xmm1 , xmm1 ;xmm5[63:0] = sum xx 


vextractf128 xmmO,ymm6,1 
vaddpd xmmi, xmmo, xmm6 
vhaddpd xmm6 ,xmm1 , xmm1 ;xmm6[63:0] = sum yy 


vextractf128 xmmO,ymm7,1 
vaddpd xmm1,xmmO , xmm7 


vhaddpd xmm7,xmm2, xmm1 ;xmm7[63:0] = sum xy 


; Save final sum variables 


vmovsd real8 ptr [r9],xmm3 ;Save sum x 
vmovsd real8 ptr [r948],xmm4 ;Save sum y 
vmovsd real8 ptr [r9416],xmm5 ;save sum xx 
vmovsd real8 ptr [r9424],xmm6 ;save sum yy 
vmovsd real8 ptr [r9+32],xmm7 ;save sum xy 


; Calculate rho numerator 
; rho num = n * sum xy - sum x * sum y; 


vcvtsi2sd xmm2,xmm2,r10 ;xmm2 = n 

vmulsd xmmo,xmm2,xmm7 ;Xmmo = = n * sum xy 
vmulsd xmm3i,xmm3,xmm4 ;xmm1i = sum x * sum y 
vsubsd xmm7,xmmo, xmm1 ;xmm7 = rho num 


; Calculate rho denominator 

; t1 = sqrt(n * sum xx - sum x * sum x) 
; t2 = sqrt(n * sum yy - sum y * sum y) 
; rho den = t1 * t2 


vmulsd xmmo,xmm2,xmm5 ;xmmo = n * sum xx 

vmulsd xmm3,xmm3,xmm3 ;xmm3 = sum x * sum x 

vsubsd xmm3,xmmo, xmm3 ;xmm3 = n * sum xx - sum x * sum x 
vsqrtsd xmm3,xmm3,xmm3 ;xmm3 = t1 

vmulsd xmmo,xmm2,xmm6 ;xmmO = n * sum yy 

vmulsd xmm4,xmm4,xmm4 ;xmm4 = sum y * sum y 
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vsubsd xmm4,xmmO , xmm4 ;xmm4 = n * sum yy - sum y * sum y 
vsqrtsd xmm4,xmm4,xmm4 ;xmm4 = t2 
vmulsd xmmo,xmm3,xmm4 ;xmmO = rho den 


; Calculate and save final rho 
xor eax,eax 
vcomisd xmmO,real8 ptr [rbp+CC_OffsetStackArgs] ;rho den < epsilon? 


setae al ;set return code 
jb BadRho ;jump if rho den < epsilon 
vdivsd xmmi,xmm7,xmmO ;xmm1 - rho 
SavRho: mov rdx,[rbp4CC OffsetStackArgs48] ;rdx = ptr to rho 
vmovsd real8 ptr [rdx],xmmi ;save rho 


Done:  vzeroupper 
 RestoreXmmRegs xmm6,xmm7 
_DeleteFrame 
ret 


; Error handling code 


BadRho: vxorpd xmm1, xmm1,xmm1 ;rho = 0 
jmp SavRho 

BadArg: xor eax,eax ;eax - invalid arg ret code 
jmp Done 


AvxCalcCorrCoef  endp 
end 


A correlation coefficient measures the strength of association between two variables. Correlation 
coefficients can range in value from -1.0 to +1.0, signifying a perfect negative or positive relationship between 
the two variables. Real-world correlation coefficients are rarely equal to these theoretical limits. A correlation 
coefficient of 0.0 indicates that the data variables are not associated. The C++ and assembly language code in 
this example calculate the well-known Pearson correlation coefficient using the following equation: 


n xy, ~ >x >» 


yj) 


In order to calculate a correlation coefficient using this formula, a function must compute the following 
five sum variables: 


sum. xy = D xy, 
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The C++ function AvxCalcCorrCoefCpp shows how to calculate a correlation coefficient. This function 
begins by checking the value of n to make sure it's greater than zero. It also validates the two data arrays x 
and y for proper alignment. The aforementioned sum variables are then calculated using a simple for loop. 
Following completion of the for loop, the function AvxCalcCorrCoefCpp saves the sum variables to the array 
sums for comparison and display purposes. It then computes the intermediate values rho num and rho den. 
Before computing the final correlation coefficient rho, rho den is tested to confirm that it's greater than or 
equal to epsilon. 

Following its prolog, the assembly language function AvxCalcCorrCoef performs the same size and 
alignment checks as its C++ counterpart. It then initializes packed versions of sum x, sum y, sum xx, sum yy, 
and sum xy to zero in registers YMM3-YMM 7. During each iteration, the loop labeled LP1 processes four 
elements from arrays x and y using packed double-precision floating-point arithmetic. This means that 
registers YMM3-YMM7 maintain four distinct intermediate values for each sum variable. Execution of loop 
LP1 continues until there are fewer than four elements remaining to process. 

Following completion of loop LP1, the loop labeled LP2 processes the final (1-3) entries in arrays x 
and y. The vmovsd xmmo,real8 ptr [rcx] and vmovsd xmm1,real8 ptr [rdx] instructions load x[i] and 
y[i] into registers XMMO and XMMI, respectively. Note that these vmovsd instructions also zero out bits 
YMM0[255:64] and YMM1[255:64], which means that the same chain of vaddpd and vmulpd instructions 
used in loop LP1 to update the intermediate sum variables can also be used in loop LP2 (the scalar 
instructions vaddsd and vmulsd cannot be used here to update the sum variables without extra code since 
these instructions set bits 255:128 of their destination operand register to zero). Following completion of 
loop LP2, each packed sum variable is reduced to a single value using a vextractf128, vaddpd, and vhaddpd 
instruction, as illustrated in Figure 9-3. The final sum values are then saved to the sums array. 


Initial packed value of sum x 


vextractf128 xmmO, ymm3,1 
vaddpd xmml,xmmO, xmm3 


vhaddpd xmm3,xmmi,xmml 


8663.0 ymm3 


- Don't care value 


Figure 9-3. Calculation of sum x using vextractf128, vaddpd, and vhaddpd 
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Function AvxCalcCorrCoef uses simple scalar arithmetic to compute the intermediate values rho num 
and rho den. Like the corresponding C++ function, AvxCalcCorrCoef compares rho den to see if it's less 
than epsilon (a value below epsilon is likely a rounding error and considered too close to zero to be valid). 
If rho denis valid, the correlation coefficient rho is calculated and saved. Here are the results for source code 
example Ch09 04: 


Results for AvxCalcCorrCoef 


Value C++ x86-AVX 
rho: 0.70128193 0.70128193 
sum x: 53081.0 53081.0 
sum y: -199158.0 -199158.0 
sum xx: 35732585.0 35732585.0 
sum yy: 401708868.0 401708868.0 
sum xy: -94360528.0 -94360528.0 


Matrix Multiplication and Transposition 


In Chapter 6, you learned how to perform 4 x 4 matrix transposition and multiplication using single- 
precision floating-point values (see source code examples Ch06 07 and Ch06 08). The source code 
example in this section illustrates how to carry out these same matrix operations using double-precision 
floating-point values. Listing 9-5 shows the source code for example Ch09 05. The fundamentals of matrix 
transposition and multiplication are explained in Chapter 6. If your understanding of these mathematical 
operations is lacking, you may want to review the relevant sections in Chapter 6 before proceeding. 


Listing 9-5. Example Ch09 05 


#include "stdafx.h" 
#include <iostream> 
#include «iomanip» 

#include "Cho9 05.h" 
#include "Matrix.h" 


using namespace std; 


void AvxMat4x4TransposeF64(Matrix<double>& m src1) 
{ 

const size t nr = 4; 

const size t nc = 4; 

Matrix«double» m desi(nr ,nc); 

Matrix«double» m des2(nr ,nc); 
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Matrix«double»::Transpose(m desi, m src1); 
AvxMat4x4TransposeF64 (m des2.Data(), m src1.Data()); 


cout «« fixed «« setprecision(1); 


m srci.SetOstream(12, " "); 
m desi.SetOstream(12, " "); 
m des2.SetOstream(12, " "); 


cout << "Results for AvxMat4x4TransposeF64\n"; 
cout << "Matrix m srciMn" << m srci << '\n'; 
cout << "Matrix m des1\n" << m desi << 'Mn'; 
cout << "Matrix m des2\n" << m des2 << ‘\n'; 


if (m desi !- m des2) 
cout << "\nMatrix compare failed - AvxMat4x4TransposeF64\n"; 


} 
void AvxMat4x4MulF64(Matrix<double>& m srci, Matrix«double»& m src2) 
1 
const size t nr - 4; 
const size t nc - 4; 
Matrix«double» m desi(nr ,nc); 
Matrix«double» m des2(nr ,nc); 
Matrix«double»::Mul(m desi, m srci, m src2); 
AvxMat4x4MulF64 (m des2.Data(), m srci.Data(), m src2.Data()); 
cout «« fixed «« setprecision(1); 
m srci.SetOstream(12, " "); 
m src2.SetOstream(12, " "); 
m desi.SetOstream(12, " "); 
m des2.SetOstream(12, " "); 
cout << "\nResults for AvxMat4x4MulF64\n"; 
cout << "Matrix m srciMn" << m srci << '\n'; 
cout << "Matrix m src2\n" << m src2 << '\n'; 
cout << "Matrix m des1\n" << m desi << '\n'; 
cout << "Matrix m des2\n" << m des2 << ‘\n'; 
if (m desi !- m des2) 
cout << "\nMatrix compare failed - AvxMat4x4MulF64\n"; 
} 
int main() 
{ 


const size t nr = 4; 
const size t nc = 4; 
Matrix«double» m srci(nr ,nc); 
Matrix«double» m src2(nr ,nc); 
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const 
const 
const 
const 


const 
const 
const 
const 


m srci. 
m srci. 
m srci. 
m srci. 


m src2 


m src2. 
m src2. 
m src2. 


double 
double 
double 
double 


double 
double 
double 
double 


SetRow(0, 
SetRow(1, 
SetRow(2, 
SetRow(3, 


.SetRow(0, 
SetRow(1, 
SetRow(2, 
SetRow(3, 


srci rowo[ 
srci rowi[ 
srci row2[ 
srci row3[ 


src2 rowo[ 
src2_row1[ 
src2 row2[ 
src2 row3[ 


AAAS 


] 
] 
] 
] 


] 
] 
] 
] 


AAA 


Src1 rowO); 
SIC1 row1); 
SIC1 row2); 
SIC1 row3); 


Src2 rowO); 
SIC2 row1); 
src2_row2); 
SIC2 row3); 


10, 
20, 
30, 
40, 


100, 101, 
201, 
301, 
401, 


200, 
300, 
400, 


11, 
21, 
31, 
41, 


12, 
22, 
32, 
42, 


402, 


13 }; 
23 }; 
33 }; 
43 }; 


103 }; 
203 }; 
303 }; 
403 }; 


102, 
202, 
302, 


// Test functions 
AvxMat4x4TransposeF64(m src1); 
AvxMat4x4MulF64(m srci, m src2); 


// Benchmark functions 
AvxMat4x4TransposeF64 BM(); 


AvxMat4x4MulF64 BM(); 
return 0; 


include «MacrosX86-64-AVX.asmh» 
; Mat4x4TransposeF64 macro 


; Description: This macro computes the transpose of a 4x4 
H double-precision floating-point matrix. 


; Input Matrix Output Matrix 


; ymmo 33 a2 a1 a0 ymmo dO cO bO a0 
; ymmi b3 b2 bi bo ymm1 d1 c1 b1 ai 
; ymm2 C3 c2 c1 cO ymm2 d2 c2 b2 a2 
; ymm3 d3 d2 d1 dO ymm3 d3 c3 b3 a3 
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_Mat4x4TransposeF64 macro 
vunpcklpd ymm4,ymmO, ymm1 
vunpckhpd ymm5,ymmO, ymm1 
vunpcklpd ymm6, ymm2, ymm3 
vunpckhpd ymm7, ymm2 , ymm3 


vperm2f128 ymmo, ymm4, ymm6 , 20h 
vperm2f128 ymm1,ymm5, ymm7,20h 
vperm2f128 ymm2,ymm4, ymm6 , 31h 
vperm2f128 ymm3,ymm5,ymm7, 31h 
endm 


; extern "C" void AvxMat4x4TransposeF64 (double* m des, const double* m src1) 


.code 

AvxMat4x4TransposeF64 proc frame 
_CreateFrame MT ,0,32 
 SaveXmmRegs xmm6,xmm7 
 EndProlog 


; Transpose matrix m srci 
vmovaps ymmo, [rdx] 
vmovaps ymm1, [rdx+32] 
vmovaps ymm2,[rdx+64] 
vmovaps ymm3,[rdx+96] 


_Mat4x4TransposeF64 


vmovaps 
vmovaps 
vmovaps 
vmovaps 


[rcx], ymmo 
[rcx432], ymm1 
[rcx464], ymm2 
[rcx496], ymm3 
vzeroupper 
Done: X RestoreXmmRegs xmm6,xmm7 
_DeleteFrame 
ret 
AvxMat4x4TransposeF64_ endp 


; Mat4x4MulCalcRowF64 macro 


; Description: This macro computes one row of a 4x4 matrix multiplication. 


; Registers: ymmO = m src2.rowO 


3 yrm4 
;ymm5 
; ymm6 
;ymm7 


3 ymmo 
symm 
;ymm2 
;ymm3 


;ymmo 
symm. 
;ymm2 
;ymm3 


;save 
;save 
;save 
;save 


; ymmi = m src2.rowi 

; ymm2 = m src2.row2 

P ymm3 - m src2.row3 

; rcx = m des ptr 

P rdx - m srci ptr 

; ymm4 - ymm4 = scratch registers 


CHAPTER 9 


= b2 
= b3 
= d2 
= d3 


= do 
= di 
= d2 
= d3 


= m srci.row O 
- m Src2.row 1 
= m SIC3.IOW 2 
= m SIC4.IOW 3 


m des.row O 
m des.row 1 
m des.row 2 
m des.row 3 


a2 
33 
c2 
c3 


co 
ci 
c2 
c3 
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bo 
bi 
do 
d1 


bo 
b1 
b2 
b3 
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_Mat4x4MulCalcRowF64 macro disp 
vbroadcastsd ymm4,real8 ptr [rdx+disp] ;broadcast m srci[i][O 
vbroadcastsd ymm5,real8 ptr [rdx+disp+8] ;broadcast m srci[i][ 
vbroadcastsd ymm6,real8 ptr [rdx+disp+16] X ;broadcast m srci[i][ 

[il 


vbroadcastsd ymm7,real8 ptr [rdx+disp+24] broadcast m srci[i 


] 
1] 
2] 
3] 


vmulpd ymm4,ymm4, ymmo ;m srci[i][0] * m src2.row 0 
vmulpd ymm5 ,ymm5 , ymm1 ;m_src1[i][1] * m src2.row 1 
vmulpd ymmó,ymm6,ymm2 ;m srci[i][2] * m src2.row 2 
vmulpd ymm],ymm;7,ymm3 ;m srci[i][3] * m src2.row 3 
vaddpd ymm4, ymm4 , ymm5 ;calc m des.row i 

vaddpd ymm6,ymm6, ymm7 

vaddpd ymm4, ymm4, ymm6 

vmovapd [rcx+disp], ymm4 ;save m des.row i 


endm 


; extern "C" void AvxMat4x4MulF64 (double* m des, const double* m srci, const double* m_ 
src2) 


AvxMat4x4MulF64 proc frame 
_CreateFrame MM ,0,32 
_SaveXmmRegs xmm6, xmm7 


_EndProlog 

; Load m src2 into YMM3:YMMO 
vmovapd ymmo, [r8] ;ymmo = m src2.row O 
vmovapd ymm1, [r8432] ;ymm1 = m src2.row 1 
vmovapd ymm2, [r8464] ;ymm2 = m src2.row 2 
vmovapd ymm3, [r8496] ;ymm3 = m src2.row 3 

; Compute matrix product 
_Mat4x4MulCalcRowF64 0 ;calculate m des.row O 
_Mat4x4MulCalcRowF64 32 ;calculate m des.row 1 
_Mat4x4MulCalcRowF64 64 ;calculate m des.row 2 
_Mat4x4MulCalcRowF64 96 ;calculate m des.row 3 
vzeroupper 

Done: | RestoreXmmRegs xmm6,xmm7 
_DeleteFrame 
ret 

AvxMat4x4MulF64  endp 
end 


The C++ source code that's shown in Listing 9-5 is very similar to what you saw in Chapter 6. It begins 
with a function named AvxMat4x4TransposeF 64 that exercises both the C++ and assembly language matrix 
transposition calculating routines and displays the results. The function that follows, AvxMat4x4MulF64, 
implements the same tasks for matrix multiplication. Similar to the source code examples in Chapter 6, 
the C++ versions of matrix transposition and multiplication are implemented by the template functions 
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Matrix«»::Transpose and Matrix<>: :Mul, respectively. Chapter 6 contains additional details regarding 
these template functions. 

Near the top of the assembly language code is a macro named Mat4x4TransposeF64. This macro 
contains instructions that transpose a 4 x 4 matrix of double-precision floating-point values. The four rows 
ofthe source double-precision floating-point matrix must be loaded in registers YMMO0-YMM3 prior to its 
use. Macro Mat4x4TransposeF64 uses the vperm2f128 instruction to permute the 128-bit wide floating- 
point fields of its two source operands. This instruction uses an immediate 8-bit control mask to select which 
fields are copied from the source operands to the destination operand, as outlined in Table 9-1. Figure 9-4 
shows the entire 4 x 4 matrix transposition operation in greater detail. The assembly language function 
AvxMat4x4TransposeF64 uses the macro Mat4x4TransposeF64 to transpose a 4 x 4 matrix of double- 
precision floating-point values. 


Table 9-1. Field Selection for vperm2f128 ymmo, ymm1, ymm2 , imm8 Instruction 
Destination Field Source Field imm8[1:0] imm8[4:3] 
ymm0[127:0] ymm1[127:0 0 
ymm1([255:128] 1 
ymm2[127:0 2 
ymm2[255:128] 3 
ymm0[255:128] ymm1[127:0 
ymm1([255:128] 


ymm2[127:0 


wo N mm © 


ymm2/[255:128] 
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ymmo 

10 11 12 13 
A2 21 22 23 mm 
[30 31 32 33 ymm2 
40 41 42 43 —-— 
vunpcklpd ymm4, ymm0, ymml ymm4 
vunpckhpd ymm5, yramO, ymm1 ymms 
vunpcklpd ymm6, ymm2, ymm3 ymm6 
vunpckhpd ymm7, ymm2 , ymm3 ymm7 
vperm2f128 ymm0, ymm4, ymm6, 20h ymmo 
vperm2f128 ymm1, ymm5, ymm7, 20h ymm1 
vperm2f128 ymm2, ymm4, ymm6, 31h ymm2 
vperm2f128 ymm3, ymm5, ymm7, 31h ymm3 

10 20 30 40 

cH 21 31 41 

12 22 32 42 

13 23 33 43 


Figure 9-4. Instruction sequence used by Max4x4TransposeF64 to transpose a 4 x 4 matrix of 
double-precision floating-point values 


In Listing 9-5, the macro definition Mat4x4MulCalcRowF64 follows the function 
AvxMat4x4TransposeF64 . This macro contains instructions that calculate a single row of a 4 x 4 matrix 
multiplication. The row-multiplication technique that's used here is identical to the one that was used in 
source code example Ch06 08 in Chapter 6 (see Figure 6-7). The function AvxMat4x4MulF64 uses the macro 
Mat4x4MulCalcRowF64 to multiply two 4 x 4 double-precision floating-point matrices. Here are the results 
for source code example Ch09 05: 
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Results for AvxMat4x4TransposeF64 


Matrix m srci 


10.0 11.0 12.0 13.0 
20.0 21.0 22.0 23.0 
30.0 31.0 32.0 33.0 
40.0 41.0 42.0 43.0 
Matrix m desi 
10.0 20.0 30.0 40.0 
11.0 21.0 31.0 41.0 
12.0 22.0 32.0 42.0 
13.0 23.0 33.0 43.0 
Matrix m des2 
10.0 20.0 30.0 40.0 
11.0 21.0 31.0 41.0 
12.0 22.0 32.0 42.0 
13.0 23.0 33.0 43.0 
Results for AvxMat4x4MulF64 
Matrix m srci 
10.0 11.0 12.0 13.0 
20.0 21.0 22.0 23.0 
30.0 31.0 32.0 33.0 
40.0 41.0 42.0 43.0 
Matrix m src2 
100.0 101.0 102.0 103.0 
200.0 201.0 202.0 203.0 
300.0 301.0 302.0 303.0 
400.0 401.0 402.0 403.0 
Matrix m desi 
12000.0 12046.0 12092.0 12138.0 
22000.0 22086.0 22172.0 22258.0 
32000.0 32126.0 32252.0 32378.0 
42000.0 42166.0 42332.0 42498.0 
Matrix m des2 
12000.0 12046.0 12092.0 12138.0 
22000.0 22086.0 22172.0 22258.0 
32000.0 32126.0 32252.0 32378.0 
42000.0 42166.0 42332.0 42498.0 


Running benchmark function AvxMat4x4TransposeF64 BM - please wait 
Benchmark times save to file Cho9 05 AvxMat4x4TransposeF64 BM CHROMIUM. csv 


Running benchmark function AvxMat4x4MulF64 BM - please wait 
Benchmark times save to file Cho9 05 AvxMat4x4MulF64 BM CHROMIUM.csv 
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Tables 9-2 and 9-3 contain benchmark timing measurements for the matrix transposition and 
multiplication functions presented in this section. These measurements were made using the procedure 
that's described in Chapter 6. 


Table 9-2. Matrix Transposition Mean Execution Times (Microseconds), 1,000,000 Transpositions 


CPU C++ Assembly Language 
i7-4790S 15562 2670 
i9-7900X 13167 2112 
i7-8700K 12194 1963 


Table 9-3. Matrix Multiplication Mean Execution Times (Microseconds), 1,000,000 Multiplications 


CPU C++ Assembly Language 
i7-4790S 55652 5874 
i9-7900X 46910 5286 
i7-8700K 43118 4505 


Matrix Inversion 


Besides transposition and multiplication, matrix inversion is another common operation that’s often applied 
to 4 x 4 matrices. In this section, you examine a program that calculates the inverse of a 4 x 4 matrix of 
double-precision floating-point values. Listing 9-6 shows the source code for example Ch09_06. 


Listing 9-6. Example Ch09_06 


#include "stdafx.h" 
#include <cmath> 

#include "Cho9 06.h" 
#include "Matrix.h" 


using namespace std; 


bool Avx2Mat4x4InvF64Cpp(Matrix<double>& m inv, const Matrix<double>& m, double epsilon, 
bool* is singular) 
{ 
// The intermediate matrices below are declared static for benchmarking purposes. 
static const size t nrows = 4; 
static const size t ncols = 4; 
static Matrix<double> m2(nrows, ncols); 
static Matrix<double> m3(nrows, ncols); 
static Matrix«double» m4(nrows, ncols); 


320 


CHAPTER 9 = AVX2 PROGRAMMING — PACKED FLOATING-POINT 


static Matrix<double> I(nrows, ncols, true); 
static Matrix<double> tempA(nrows, ncols); 
static Matrix<double> tempB(nrows, ncols); 
static Matrix<double> tempC(nrows, ncols); 
static Matrix<double> tempD(nrows, ncols); 


Matrix<double>::Mul(m2, m, m); 
Matrix<double>::Mul(m3, m2, m); 
Matrix«double»::Mul(m4, m3, m); 


double t1 = m.Trace(); 
double t2 - m2.Trace(); 
double t3 - m3.Trace(); 
double t4 = m4.Trace(); 


double c1 = -t1; 

double c2 = -1.0 / 2.0 * (c1 * t1 + t2); 

double c3 = -1.0 / 3.0 * (c2 * t1 + c1 * t2 + t3); 

double c4 = -1.0 / 4.0 * (c3 * t1 + c2 * t2 + c1 * t3 + t4); 


// Make sure matrix is not singular 
*is singular - (fabs(c4) « epsilon); 


if (*is singular) 
return false; 


// Calculate = -1.0 / c4 * (m3 + c1 * m2 + c2 * m + c3 * I) 
Matrix«double»::MulScalar(tempA, I, c3); 
Matrix«double»::MulScalar(tempB, m, c2); 
Matrix«double»::MulScalar(tempC, m2, c1); 
Matrix«double»::Add(tempD, tempA, tempB); 
Matrix«double»::Add(tempD, tempD, tempC); 
Matrix«double»::Add(tempD, tempD, m3); 
Matrix«double»::MulScalar(m inv, tempD, -1.0 / c4); 


return true; 


) 


void Avx2Mat4x4InvF64(const Matrix<double>& m, const char* msg) 


{ 


cout << '\n' << msg << 
cout << m << '\n'; 


- Test Matrix\n"; 


const double epsilon - 1.0e-9; 

const size t nrows - m.GetNumRows(); 
const size t ncols - m.GetNumCols(); 
Matrix«double» m inv a(nrows, ncols); 
Matrix«double» m ver a(nrows, ncols); 
Matrix«double» m inv b(nrows, ncols); 
Matrix«double» m ver b(nrows, ncols); 
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for (int i = 0; i <= 1; i++) 


{ 


string fn; 

const size t nrows = m.GetNumRows(); 
const size t ncols = m.GetNumCols(); 
Matrix«double» m inv(nrows, ncols); 
Matrix«double» m ver(nrows, ncols); 
bool rc, is singular; 


if (i == 0) 
{ 


fn = "Avx2Mat4x4InvF64Cpp"; 


rc = Avx2Mat4x4InvF64Cpp(m_inv, m, epsilon, &is singular); 
if (rc) 
Matrix<double>::Mul(m ver, m inv, m); 
} 
else 
{ 
fn = "Avx2Mat4x4InvF64 "; 
rc = Avx2Mat4x4InvF64 (m inv.Data(), m.Data(), epsilon, &is singular); 
if (x6) 
Avx2Mat4x4MulF64 (m ver.Data(), m inv.Data(), m.Data()); 
) 
if (XC) 
{ 
cout << msg << " - " << fn << " - Inverse Matrix\n"; 
cout << m inv << '\n'; 
// Round to zero used for display purposes, can be removed. 
cout << msg << " - " << fn << " - Verify Matrix\n"; 
m ver.RoundToZero(epsilon); 
cout << m ver << ‘\n'; 
} 
else 
{ 
if (is singular) 
cout << msg << " - " << fn << " - Singular Matrix\n"; 
else 
cout << msg << " - " << fn << " - Unexpected error occurred\n"; 
) 
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int main() 


1 
cout << "\nResults for Avx2Mat4x4InvF64\n"; 


// Test Matrix #1 - Non-Singular 
Matrix«double» m1(4, 4); 

const double mi rowo[] = { 2, 7, 3, 4 }; 

const double m1 row1[] = { 5, 9, 6, 4.75 }; 
const double mi row2[] = { 6.5, 3, 4, 10 }; 
const double mi row3[] = { 7, 5.25, 8.125, 6 }; 
m1.SetRow(0, m1 rowO); 

m1.SetRow(1, m1 row1); 

m1.SetRow(2, m1 row2); 

m1.SetRow(3, m1 row3); 


// Test Matrix #2 - Non-Singular 
Matrix<double> m2(4, 4); 
const double m2 rowo[] = 
const double m2 row1[] = 
const double m2 row2[] - 
const double m2 row3[] - 
m2.SetRow(0, m2 rowO); 
m2.SetRow(1, m2 row1); 
m2.SetRow(2, m2 row2); 
m2.SetRow(3, m2 row3); 


0.5, 12, 17.25, 4 }; 
5, 2, 6.75, 8 }; 

13.125, 1, 3, 9.75 }; 
16, 1.625, 7, 0.25 }; 


AS PS PS 


// Test Matrix #3 - Singular 
Matrix«double» m3(4, 4); 

const double m3 rowo[] (2, 0, 0, 1 }; 
const double m3 row1[] (0, 4, 5, 0 }; 
const double m3 row2[] = (0, 0, 0, 7 }; 
const double m3 row3[] = (0, 0, 0, 6 }; 
m3.SetRow(0, m3 rowO); 

m3.SetRow(1, m3 row1); 

m3.SetRow(2, m3 row2); 

m3.SetRow(3, m3 row3); 


Avx2Mat4x4InvF64(m1, "Test #1"); 
Avx2Mat4x4InvF64(m2, "Test #2"); 
Avx2Mat4x4InvF64(m3, "Test #3"); 


Avx2Mat4x4InvF64 BM(m1); 
return 0; 
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include <MacrosX86-64-AVX.asmh> 


; Custom segment for constants 
ConstVals segment readonly align(32) 'const' 
Mat4x4I real8 1.0, 0.0, 0.0, 0.0 

real8 0.0, 1.0, 0.0, 0.0 

real8 0.0, 0.0, 1.0, 0.0 

real8 0.0, 0.0, 0.0, 1.0 


r8 SignBitMask  qword 4 dup (8000000000000000h) 


r8 AbsMask qword 4 dup (7fffffffffffffffh) 
r8 1pO real8 1.0 
r8 N1p0 real8 -1.0 
r8 NOp5 real8 -0.5 
r8 NOp3333 real8 -0.33333333333333 
r8 NOp25 real8 -0.25 
ConstVals ends 
.code 


 Mat4x4TraceF64 macro 


Description: This macro contains instructions that compute the trace 
of the 4x4 double-precision floating-point matrix in ymm3:ymmo. 


Ve Le ve ve 


_Max4x4TraceF64 macro 


vblendpd ymmo, ymmo, ymm1,00000010b ;ymmo[127:0] = row 1,0 diag vals 
vblendpd ymmi,ymm2,ymm3,00001000b ;ymm1[255:128] = row 3,2 diag vals 
vperm2f128 ymm2,ymm1,ymm1,00000001b ;ymm2[127:0] = row 3,2 diag vals 
vaddpd ymm3, ymmo, ymm2 

vhaddpd ymmo, ymm3 , ymm3 ;xmmo[63:0] = trace 

endm 


extern "C" double Avx2Mat4x4TraceF64 (const double* m srci) 


Description: The following function computes the trace of a 
4x4 double-precision floating-point array. 


Ve Le ve Le 


Avx2Mat4x4TraceF64 proc 


vmovapd ymmo, [rcx] ;ymmo = m srci.row O 
vmovapd ymm1, [rcx+32 ] ;ymm1 = m srci.row 1 
vmovapd ymm2,[rcx«64] ;ymm2 = m srci.row 2 
vmovapd ymm3, [rcx+96 ] ;ymm3 = m srci.row 3 
_Max4x4TraceF64 ;xmmo[63:0] = m srci.trace() 
vzeroupper 

ret 


Avx2Mat4x4TraceF64  endp 
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; Mat4x4MulCalcRowF64 macro 


; Description: This macro is used to compute one row of a 4x4 matrix 


; multiply. 

; 

; Registers: ymmO = m src2.rowO 

$ ymmi = m src2.rowi 

; ymm2 = m src2.row2 

F ymm3 = m src2.row3 

f ymm4 - ymm7 = scratch registers 


_Mat4x4MulCalcRowF64 macro dreg,sreg,disp 


vbroadcastsd ymm4,real8 ptr [sreg+disp] ;broadcast m src1[i][o] 
vbroadcastsd ymm5,real8 ptr [sreg+disp+8] broadcast m srci[i][1] 
vbroadcastsd ymm6,real8 ptr [sreg+disp+16] ;broadcast m srci[i][2] 
vbroadcastsd ymm7,real8 ptr [sreg+disp+24] ;broadcast m srci[i][3] 


vmulpd ymm4,ymm4, ymmo ;m srci[i][0] * m src2.row 0 
vmulpd ymm5,ymms5,ymmi ;m srci[i][1] * m src2.row 1 
vmulpd ymmó,ymm6,ymm2 ;m srci[i][2] * m src2.row 2 
vmulpd ymm],ymm;,ymm3 ;m srci[i][3] * m src2.row 3 
vaddpd ymm4, ymm4 , ymm5 ;calc m des.row i 

vaddpd ymm6,ymm6, ymm7 

vaddpd ymm4,ymm4, ymm6 

vmovapd[dreg+disp], ymm4 ;save m des.row i 


endm 


; extern "C" void Avx2Mat4x4MulF64 (double* m des, const double* m srci, const double* m_ 
src2) 


Avx2Mat4x4MulF64 proc frame 
_CreateFrame MM ,0,32 
 SaveXmmRegs xmm6,xmm7 


 EndProlog 

vmovapd ymmo, [r8] ;ymmo = m src2.row O 
vmovapd ymm1, [r8+32] jymm1 = m src2.row 1 
vmovapd ymm2,[18+64] ;ymm2 = m src2.row 2 
vmovapd ymm3,[1r8+96] ;ymm3 = m src2.row 3 
 Mat4x4MulCalcRowF64 rcx,rdx,0 ;calculate m des.row O 
 Mat4x4MulCalcRowF64 rcx,rdx,32 ;calculate m des.row 1 


 Mat4x4MulCalcRowF64 rcx,rdx,64 ;calculate m des.row 2 
 Mat4x4MulCalcRowF64 rcx,rdx,96 ;calculate m des.row 3 


vzeroupper 
 RestoreXmmRegs xmm6, xmm7 
_DeleteFrame 
ret 

Avx2Mat4x4MulF64_ endp 
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; extern "C" bool Avx2Mat4x4InvF64 (double* m inv, const double* m, double epsilon, bool* 
is singular); 


; Offsets of intermediate matrices on stack relative to rsp 
OffsetM2 equ 32 

OffsetM3 equ 160 

OffsetM4 equ 288 


Avx2Mat4x4InvF64 proc frame 
_CreateFrame MI ,0,160 
_SaveXmmRegs xmm6,xmm7 , xmm8 , xmm9 , xmm10 , xmm11 , xmm12 , xmm13 , xmm14 , xmm15 


 EndProlog 
; Save args to home area for later use 
mov qword ptr [rbp«MI OffsetHomeRCX],rcx ;save m inv ptr 
mov qword ptr [rbp«MI OffsetHomeRDX],rdx ;save m ptr 
vmovsd real8 ptr [rbp+MI_OffsetHomeR8 ] , xmm2 ;save epsilon 
mov qword ptr [rbp«MI OffsetHomeR9],r9 ;save is singular ptr 
; Allocate 384 bytes of stack space for temp matrices + 32 bytes for function calls 
and rsp,Offffffeoh ;align rsp to 32-byte boundary 
sub rsp,416 ;alloc stack space 


3 Calculate m2 


lea rcx, [rsp4OffsetM2] jrcx = m2 ptr 
mov r8,rdx ;rdx, r8 = m ptr 
call Avx2Mat4x4MulF64 ;calculate and save m2 


; Calculate m3 


lea rcx, [rsp4OffsetM3] 3rcx = m3 ptr 

lea rdx,[rsp4OffsetM2] ;rdx = m2 ptr 

mov r8,[rbp«MI OffsetHomeRDX] 318 =m 

call Avx2Mat4x4MulF64_ ;calculate and save m3 


3 Calculate m4 


lea rcx,[rspt+0ffsetM4 ] jrcx = m4 ptr 

lea rdx,[rsp+0ffsetM3] ;rdx = m3 ptr 

mov r8,[rbp4«MI OffsetHomeRDX] 318 =m 

call Avx2Mat4x4MulF64 ;calculate and save m4 


; Calculate trace of m, m2, m3, and m4 
mov rcx, [rbp4MI OffsetHomeRDX] 
call Avx2Mat4x4TraceF64 | 


vmovsd xmm8, xmm8 , xmmO ;xmm8 = t1 
lea rcx, [rsp4OffsetM2] 

call Avx2Mat4x4TraceF64 | 

vmovsd xmm9 ,xmm9 , xmmO ;xmm9 = t2 
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lea rcx, [rsp+0ffsetM3] 
call Avx2Mat4x4TraceF64 | 
vmovsd xmm10,xmm1O, xmmO 


lea rcx, [rsp«OffsetMA] 
call Avx2Mat4x4TraceF64 | 
vmovsd xmm11,xmm11, xmmO 


Calculate the required coefficients 
c1 = -t1; 
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;xmm10 


;xmm10 


C2 = -1.0f / 2.0f * (c1 * t1 + t2); 

C3 = -1.0f / 3.0f * (c2 * t1 + c1 * t2 + t3); 

C4 = -1.0f / 4.0f * (c3 * t1 + c2 * t2 + c1 * t3 + t4); 

Registers used: 

t1-t4 = xmm8-xmm11 
C1-c4 = xmm12-xmm15 

vxorpd xmm12,xmm8,real8 ptr [r8 SignBitMask] ;xmm12 = c1 
vmulsd xmm13,xmm12,xmm8 sd, * t1 
vaddsd xmm13,xmm13, xmm9 ;c1 * t1 + t2 
vmulsd xmm13,xmm13, [r8 Nop5] sc? 
vmulsd xmm14, xmm13 , xmm8 262.5 ET 
vmulsd xmmO, xmm12, xmm9 sci * t2 
vaddsd xmm14, xmm14, xmmo 3c2 * t1 + c1 * t2 
vaddsd xmm14,xmm14,xmm10 3c2 * t1 + c1 * t2 + t3 
vmulsd xmm14,xmm14,[r8 Nop3333] ;c3 
vmulsd xmm15,xmm14,xmm8 5c3 * t1 
vmulsd xmmo,xmm13,xmm9 $c2 * t2 
vmulsd xmm1,xmm12,xmm10 icd * £3 
vaddsd xmm2,xmmoO, xmm1 ;c2 * t2 + c1 * t3 
vaddsd xmm15,xmm15,xmm2 jc3 * t1 + c2 * t2 + cà * t3 
vaddsd xmm15,xmm15,xmm11 3c3 * t1 + c2 * t2 + c1 * t3 + t4 
vmulsd xmm15,xmm15,[r8 NOp25]  ;c4 


; Make sure matrix is not singular 


vandpd xmmo,xmm15,[r8 AbsMask] 


vmovsd xmm1,real8 ptr [rbp+MI OffsetHomeR8] 


vcomisd xmmO,real8 ptr [rbp«MI OffsetHomeR8] 


setp al 

setb ah 

or al,ah 

mov rcx, [rbp4MI OffsetHomeR9] 
mov [rcx],al 

jnz Error 


;compute fabs(c4) 


;compare against epsilon 

;set al = if unordered 

;set ah = if fabs(c4) < epsilon 
jal = is singular 

;rax = is singular ptr 

;save is singular state 

;jump if singular 
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; Calculate m inv = -1.0 / c4 * (m3 + c1 * m2 + c2 * m1 + c3 * I) 


vbroadcastsd ymmi4,xmm14 ;ymm14 = packed c3 
lea rcx, [Mat4x4I] jrcx = I ptr 
vmulpd ymmo, ymm14,ymmword ptr [rcx] 

vmulpd ymmi,ymm14,ymmword ptr [rcx+32] 

vmulpd ymm2,ymm14,ymmword ptr [rcx+64] 


vmulpd ymm3,ymm14,ymmword ptr [rcx+96] 5c3 * I 
vbroadcastsd ymm13,xmm13 ;ymm13 = packed c2 
mov rcx, [rbp4MI OffsetHomeRDX] jrcx = m ptr 


vmulpd ymm4,ymm13,ymmword ptr [rcx] 

vmulpd ymm5,ymm13,ymmword ptr [rcx+32] 

vmulpd ymm6,ymm13,ymmword ptr [rcx+64] 

vmulpd ymm7,ymm13,ymmword ptr [rcx+96] ;ca* m1 
vaddpd ymmo, ymmo, ymm4 

vaddpd ymm1, ymm1, ymm5 

vaddpd ymm2,ymm2, ymm6 


vaddpd ymm3,ymm3 , ymm7 3c2 * m1 + c3 * I 
vbroadcastsd ymm12,xmm12 ;ymm12 = packed c1 
lea rcx, [rsp4OffsetM2] 3rcx = m2 ptr 


vmulpd ymm4,ymm12,ymmword ptr [rcx] 

vmulpd ymm5,ymm12,ymmword ptr [rcx+32] 

vmulpd ymm6,ymm12,ymmword ptr [rcx+64] 

vmulpd ymm7,ymm12,ymmword ptr [rcx+96] ;ci * m2 
vaddpd ymmo, ymmo, ymm4 

vaddpd ymm1, ymm1, ymm5 

vaddpd ymm2,ymm2, ymm6 


vaddpd ymm3,ymm3 , ymm7 5c1 * m2 + c2 * m1 + c3 * I 

lea rcx, [rsp+0ffsetM3] 3rcx = m3 ptr 

vaddpd ymmo, ymmo, ymmword ptr [rcx] 

vaddpd ymmi,ymmi,ymmword ptr [rcx+32] 

vaddpd ymm2,ymm2,ymmword ptr [rcx464] 

vaddpd ymm3,ymm3,ymmword ptr [rcx+96] 5m3 + cl * m2 + c2 * m1 + c3 * I 


vmovsd xmm4,[r8 N1po] 

vdivsd xmm4,xmm4, xmm15 ;xmm4 - -1.0 / c4 
vbroadcastsd ymm4, xmm4 

vmulpd ymmo, ymmo, ymm4 

vmulpd ymm1,ymm1, ymm4 

vmulpd ymm2,ymm2, ymm4 

vmulpd ymm3,ymm3,ymm4 3ymm3:ymmO = m inv 


3 Save m_inv 
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mov rcx,[rbp+MI_OffsetHomeRCx ] 

vmovapd ymmword ptr [rcx],ymmo 

vmovapd ymmword ptr [rcx+32],ymm1 

vmovapd ymmword ptr [rcx464],ymm2 

vmovapd ymmword ptr [rcx496],ymm3 

mov eax,1 ;set success return code 
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Done: vzeroupper 
 RestoreXmmRegs xmm6,xmm7, xmm8, xmm9, xmm10, xmm11 , xmm12 , xmm13 , xmm14 , xmm15 
_DeleteFrame 
ret 


Error: xor eax,eax 
jmp Done 


Avx2Mat4x4InvF64 endp 
end 


The multiplicative inverse of a matrix is defined as follows: Let A and X represent n x n matrices. Matrix 
X is an inverse of A if AX = XA = I, where I denotes an n x n identity matrix (i.e., a matrix of all zeros except 
for the diagonal elements, which are equal to one). Figure 9-5 shows an example of an inverse matrix. It 
is important to note that inverses do not exist for all n x n matrices. A matrix without an inverse is called a 
singular matrix. 


a 
N 
N 


0.1875 -0.0625 -0.125 1 0 0 
X=] 0.0625 -0.1875 0.125 | AX- XA- I-|O0 1 0 


> 
ll 
N 
l 
N 
N 


o 
A 
N 


-0.125 0.375 0.25 0 0 I 


Figure 9-5. Matrix A and its multiplicative inverse Matrix X 


The inverse of a 4 x 4 matrix can be calculated using a variety of mathematical techniques. Source code 
example Ch09 06 uses a computational method based on the Cayley-Hamilton theorem, which employs 
common matrix operations that are relatively easy to carry out using SIMD arithmetic. Here are the required 
equations: 


A! = A; A! = AA; A! = AAA; A = AAAA 
trace(A)= Ÿ a, 
t= trace( A") 


Cc =—{ 


1 
A = (at +t,) 
1 
€; = (af +cat, +h) 
1 
G= Mi te, + Gt, +t) 


1 
A" 2 -—(A! + GA? * c A? + cT) 
C, 


Toward the top of the C++ code is a function named Avx2Mat4x4InvF64Cpp. This function calculates 
the inverse of a 4 x 4 matrix of double-precision floating-point values using the aforementioned equations. 
Function Avx2Mat4x4InvF64Cpp uses the C++ class Matrix<> to perform many of the required intermediate 
computations, including matrix addition, multiplication, and trace. The source code for class Matrix«» is not 
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shown but included with the chapter download package. Note that the intermediate matrices are declared 
using the static qualifier in order to avoid constructor overhead when performing benchmark timing 
measurements. The drawback of using the static qualifier here means that the function is not thread-safe 
(a thread-safe function can be simultaneously used by multiple threads). Following calculation of the trace 
values t1 - t4, Avx2Mat4x4InvF64Cpp computes c1-c4 using simple scalar arithmetic. It then checks to 
make sure the source matrix m is not singular by comparing c4 against epsilon. If matrix m is not singular, the 
final inverse is calculated. The remaining C++ code performs test case initialization and exercises both the 
C++ and assembly language matrix inversion functions. 

The assembly language code in Listing 9-6 begins with a custom segment that contains definitions of 
the constant values needed by the assembly language matrix inversion functions. The statement ConstVals 
segment readonly align(32) 'const' marks the start of a segment that begins on a 32-byte boundary and 
contains read-only data. The reason for using a custom segment here is that the MASM align directive does 
not support aligning data items on a 32-byte boundary. In this example, proper alignment of the packed 
constants is essential in order to maximize performance. Note that the scalar double-precision floating-point 
constants are defined after the 256-bit wide packed constants and are aligned on an 8-byte boundary. The 
MASM statement ConstVals ends terminates the custom segment. 

Following the custom constant segment is the macro Max4x4TraceF64. This macro contains 
instructions that calculate the trace of a 4 x 4 matrix of double-precision floating-point values. Macro _ 
Max4x4TraceF64 requires the four rows of the source matrix to be loaded in registers YMM0-YMMSG and uses 
the vblendpd, vperm2f128, and vhaddpd instructions to calculate the matrix trace, as shown in 
Figure 9-6. The vblendpd (Blend Packed Double-Precision Floating-Point Values) instruction merges values 
from its two source operands according to an immediate control mask. If bit 0 of the control mask equals 0, 
element 0 (i.e., bits 63:0) from the first source operand is copied to the corresponding element position in 
the destination operand; otherwise, element 0 from the second source operand is copied to the destination 
operand. Bits 1-3 ofthe control mask are used in a similar manner for the other three elements. Register 
XMMO0[63:0] contains the trace value following execution of the vhaddpd instruction. 


330 


CHAPTER 9 = AVX2 PROGRAMMING — PACKED FLOATING-POINT 


? 2 19 8 
8 * s m 
"m3 1 12 
13 25 9 4 


Initial values 


vblendpd ymm0, ymmO, ymm1, 00000010b 


vblendpd ymm1, ymm2, ymm3,00001000b 


ymm1 


vperm2f128 ymm2, ymm1, ymm1,00000001b 


ymm2 


vaddpd ymm3, ymmO, ymm2 


ymm3 


vhaddpd ymm0, ymm3, ymm3 


[ | = Don’t care value 


Figure 9-6. Trace calculation for a 4 x 4 matrix 


| ymmO 


The assembly language function Avx2Mat4x4InvF64 calculates an inverse matrix using the same 
technique as the corresponding C++ function. Following its prolog, the function Avx2Mat4x4InvF64 
saves its argument values to the home area for later use. It then allocates storage space on the stack to 
hold intermediate results. More specifically, the and rsp,0ffffffeoh instruction aligns RSP to a 32-byte 
boundary, and the sub rsp,416 instruction allocates local stack space that's required for the intermediate 
matrices m2, m3, and m4 plus 32 bytes for function calls. Next, a series of calls are made to the functions 
Avx2Mat4x4MulF64 and Avx2Mat4x4TraceF64 to calculate the trace values t1-t4. The matrix multiplication 
code that's used in this example is basically the same code that you saw in example Ch09 05. The algorithm 
coefficients c1-c4 are calculated next using simple scalar floating-point arithmetic. Coefficient c4 is then 
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tested to verify that the source matrix is not singular. If the source matrix is not singular, the function 
calculates the inverse matrix m_inv. Note that all of the arithmetic required to calculate m inv is carried out 
using straightforward packed double-precision floating-point multiplication and addition. Here is the output 
for source code example Ch09 06: 


Results for Avx2Mat4x4InvF64 


Test #1 - Test Matrix 


2 7 3 4 
5 9 6 4.75 
6.5 3 4 10 
7 5.25 8.125 6 


Test #1 - Avx2Mat4x4InvF64Cpp - Inverse Matrix 
-0.943926 0.91657 0.197547 -0.425579 
-0.0568818 0.251148 0.00302831 -0.165952 
0.545399 -0.647656 -0.213597 0.505123 
0.412456 -0.412053 0.0561248 0.124363 


Test #1 - Avx2Mat4x4InvF64Cpp - Verify Matrix 
0 0 


Oo © OR 
kB © © © 


1 0 
0 1 
0 0 


Test #1 - Avx2Mat4x4InvF64_ - Inverse Matrix 
-0.943926 0.91657 0.197547 -0.425579 

-0.0568818 0.251148 0.00302831 -0.165952 
0.545399 -0.647656 -0.213597 0.505123 
0.412456 -0.412053 0.0561248 0.124363 


Test #1 - Avx2Mat4x4InvF64_ - Verify Matrix 


1 0 0 0 
0 1 0 0 
0 0 4 0 
0 0 0 1 
Test #2 - Test Matrix 
0,5 12 17.25 
5 2 6.75 8 
13.125 1 3 9.75 
16 1.625 7 0.25 


Test #2 - Avx2Mat4x4InvF64Cpp - Inverse Matrix 
0.00165165 -0.0690239 0.0549591  0.0389347 

0.135369 -0.359846 0.242038 -0.0903252 
-0.0350097 0.239298 -0.183964  0.0772214 
-0.0053352 0.056194  0.0603606 -0.0669085 
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Test #2 - Avx2Mat4x4InvF64Cpp - Verify Matrix 
0 0 


oOooon 
kB © © © 


1 0 
0 1 
0 0 


Test #2 - Avx2Mat4x4InvF64_ - Inverse Matrix 
0.00165165 -0.0690239 0.0549591  0.0389347 
0.135369 -0.359846 0.242038 -0.0903252 
-0.0350097 0.239298 -0.183964  0.0772214 
-0.0053352 0.056194  0.0603606 -0.0669085 


Test #2 - Avx2Mat4x4InvF64_ - Verify Matrix 


1 (0) (0) (0) 
0 1 0 0 
0 0 1 (0) 
(0) (0) 0 1 
Test 43 - Test Matrix 
2 0 0 1 
(0) 4 5 (0) 
(0) (0) 0 7 
0 0 0 6 


Test #3 - Avx2Mat4x4InvF64Cpp - Singular Matrix 
Test #3 - Avx2Mat4x4InvF64 - Singular Matrix 


Running benchmark function Avx2Mat4x4InvF64 BM - please wait 
Benchmark times save to file Cho9 06 Avx2Mat4x4InvF64 BM CHROMIUM. csv 


Table 9-4 contains benchmark timing measurements for the matrix inversion functions. 


Table 9-4. Matrix Inverse Mean Execution Times (Microseconds), 100,000 Inversions 


CPU C++ Assembly Language 
i7-4790S 30417 4168 
i9-7900X 26646 3773 
i7-8700K 24485 2941 


Blend and Permute Instructions 


A data blend operation conditionally copies elements from two packed source operands to a packed 
destination operand using a control mask that specifies which elements to copy. A data permute operation 
rearranges the elements of a packed source operand according to a control mask. You've already seen 
several source code examples in this chapter that exploited data blend and permute operations. The next 
source code example is named Ch09_07 and includes code that demonstrates how to use additional blend 
and permute instructions. Listing 9-7 shows the source code for example Ch09 07. 
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Listing 9-7. Example Ch09_07 


#include "stdafx.h" 
#include <cstdint> 
#include <iostream> 
#include "YmmVal.h" 


using namespace std; 
extern "C" void AvxBlendF32 (YmmVal* des1, YmmVal* srci, YmmVal* src2, YmmVal* idx1); 
extern "C" void Avx2PermuteF32 (YmmVal* desi, YmmVal* srci, YmmVal* idx1, YmmVal* des2, 


YmmVal* src2, YmmVal* idx2); 


void AvxBlendF32(void) 


{ 
const uint32_t sel0 = 0x00000000; 
const uint32 t sel1 = 0x80000000; 
alignas(32) YmmVal des1, src1, src2, idx1; 
srci.m F32[0] = 10.0f; src2.m F32[0] = 100.0f; idx1.m_132[0] = sel1; 
srci.m F32[1] = 20.0f; src2.m F32[1] = 200.0f; ïidx1.m I32[1] = selo; 
srci.m F32[2] = 30.0f; src2.m F32[2] = 300.0f; idx1.m_132[2] = selo; 
srci.m F32[3] = 40.0f; src2.m F32[3] = 400.0f; idx1.m_132[3] = sel1; 
srci.m F32[4] = 50.0f; src2.m F32[4] = 500.0f; idx1.m_132[4] = sel1; 
srci.m F32[5] = 60.0f; src2.m F32[5] = 600.0f; idx1.m_132[5] = selo; 
srci.m F32[6] = 70.0f; src2.m F32[6] = 700.0f; idx1.m_132[6] = sel1; 
srci.m F32[7] = 80.0f; src2.m F32[7] = 800.0f; idx1.m_132[7] = selo; 
AvxBlendF32_(&des1, &srci, &src2, &idx1); 
cout << "\nResults for AvxBlendF32 (vblendvps)\n"; 
cout «« fixed «« setprecision(1); 
for (size t i = 0; i« 8; i++) 
{ 
cout << "i: " << setw(2) << i <<" '5 
cout << "srci: " << setw(8) << srci.m F32[i] << " "3 
cout << "src2: " << setw(8) << src2.m F32[i] << " "5 
cout << setfill('0'); 
cout << "idx1: Ox" << setw(8) << hex << idx1.m U32[i] «« " "; 
cout << setfill(' '); 
cout << "desi: " << setw(8) << des1.m F32[i] << ‘\n'; 
} 
} 
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void Avx2PermuteF32(void) 


{ 


alignas(32) YmmVal desi, srci, idx1; 
alignas(32) YmmVal des2, src2, idx2; 


// idx1 values must 


srci.m F32[0 
srci.m F32[1 
srci.m F32[ 
srci.m F32[ 
srci.m F32[ 
srci.m F32[ 
srci.m F32[6 
srci.m F32[7 


] = 100. 
] = 200. 
] = 300. 
] = 400. 
] = 500. 
] = 600. 
] = 700. 
] = 800. 


// idx2 values must 


src2.m F32[0 
src2.m F32[1 
src2.m F32[2 
src2.m F32[3 
src2.m F32[4 
src2.m F32[5 
src2.m F32[6 
src2.m F32[7 


] = 100. 
] = 200. 
] = 300. 
] = 400. 
] = 500. 
] = 600. 
] = 700. 
] = 800. 


be between O and 7. 
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0f; idxi.m I32[0] = 3; 
0f; idxi.m I32[1] = 7; 
0f; idxi.m I32[2] = 0; 
0f; idx1.m I32[3] = 4; 
0f; idxi.m I32[4] = 6; 
0f; idxi.m I32[5] = 6; 
0f; idxi.m I32[6] = 1; 
Of; idxi.m 132[7] = 2; 
be between O and 3. 

0f; idx2.m I32[0] = 3; 
Of; idx2.m 132[1] = 1; 
0f; idx2.m I32[2] = 1; 
0f; idx2.m I32[3] = 2; 
0f; idx2.m I32[4] = 3; 
0f; idx2.m I32[5] = 2; 
0f; idx2.m I32[6] = 0; 
0f; idx2.m I32[7] = 0; 


Avx2PermuteF32 (&des1, &srci, &idx1, &des2, &src2, &idx2); 


cout << "\nResults for Avx2PermuteF32 (vpermps)\n"; 
cout «« fixed «« setprecision(1); 


for (size t i = 0; i« 8; i++) 


{ 
cout << 
cout << 
cout << 
cout << 


"ds "e 
"src: 
"idx1: 
"des1: 


cout << "\nResults 


for (size t 
cout «« 
cout «« 
cout «« 
cout «« 


1 = 0; 
"i: " < 
"src2: 
"idx2: 
"des2: 


< 


setw(2) << i << ; 
<< setw(8) << srci.m F32[i] << " "; 
<< setw(8) << idx1.m I32[i] <<" "5; 


<< setw(8) << des1.m F32[i] << ‘\n'; 


for Avx2PermuteF32 (vpermilps)\n"; 


i« 8; i++) 

< setw(2) ««i««" "; 

" << setw(8) << src2.m F32[i] << " "5 
" << setw(8) << idx2.m I32[i] <<" "5 
" << setw(8) << des2.m F32[i] << '\n'; 
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int main() 
{ 
AvxBlendF32(); 
Avx2PermuteF32(); 
return 0; 
} 
P CR 
E Cho9_07.asm 


; extern "C" void AvxBlendF32 (YmmVal* des1, YmmVal* srci, YmmVal* src2, YmmVal* idx1) 


.code 
AvxBlendF32 proc 


vmovaps ymmO,ymmword ptr [rdx] ;ymmo = srci 
vmovaps ymm1,ymmword ptr [r8]  ;ymm1 = src2 
vmovdqa ymm2,ymmword ptr [r9]  ;ymm2 - idxi 


vblendvps ymm3,ymmO, ymm1, ymm2 
vmovaps ymmword ptr [rcx],ymm3 


vzeroupper 
ret 
AvxBlendF32 endp 


;blend ymmo & ymm1, ymm2 "indices" 
;Save result to des1 


; extern "C" void Avx2PermuteF32 (YmmVal* desi, YmmVal* srci, YmmVal* idx1, YmmVal* des2, 


YmmVal* src2, YmmVal* idx2) 
Avx2PermuteF32 proc 


; Perform vpermps permutation 
vmovaps ymmO,ymmword ptr [rdx] 
vmovdqa ymm1,ymmword ptr [r8] 
vpermps ymm2,ymm1, ymmo 
vmovaps ymmword ptr [rcx],ymm2 


; Perform vpermilps permutation 
mov rdx,[rsp+40] 
mov r8,[rsp+48] 
vmovaps ymm3,ymmword ptr [rdx] 
vmovdqa ymm4,ymmword ptr [r8] 
vpermilps ymm5,ymm3,ymm4 
vmovaps ymmword ptr [r9],ymm5 


vzeroupper 
ret 

Avx2PermuteF32  endp 
end 
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;ymmo = srci 

;ymmi = idx1 

;permute ymmO using ymm1 indices 
;save result to desi 


;rdx = src2 ptr 

;18 = idx2 ptr 

;ymm3 = src2 

;ymm4 = idx1 

;permute ymm3 using ymm4 indices 
;save result to des2 
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The C++ code in Listing 9-7 begins with a function named AvxBlendF32 that initializes YmmVal variables 
src1 and src2 using single-precision floating-point values. It also initializes a third YmmVal variable named 
SIC3 for use as a blend control mask. The high-order bit of each doubleword element in src3 specifies 
whether the corresponding element from src1 (high-order bit = 0) or src2 (high-order bit = 1) is copied to 
the destination operand. These three source operands are used by the vblendvps (Variable Blend Packed 
Single- Precision Floating-Point Values) instruction, which is located in the assembly language function 
AvxBlendF32 . Following execution of this function, the results are streamed to cout. 

The C++ code in Listing 9-7 also includes a function named Avx2PermuteF32. This function 
initializes several YnmVal variables that demonstrate use of the vpermps and vpermips instructions. Both 
of these instructions require a set of indices that specify which source operand elements are copied to 
the destination operand. For example, the statement idx1.m I32[0] = 3is used to direct the vpermps 
instruction in Avx2PermuteF32 to perform des1.m F32[0] = srci.m F32[3]. The vpermps instruction 
requires each index in idx1 to be between zero and seven. An index can be used more than once in idx1 
in order to copy an element from src1 to multiple locations in des1. The vpermilps instruction requires its 
indices to be between zero and three. 

The assembly language function AvxBlendF32 begins by loading the source data operands into 
registers YMMO0 and YMM1 using two vmovaps instructions. The vmovdqa instruction that follows loads the 
blend control mask into register YMM2. The ensuing vblendvps ymm3, ymmo, ymm1 , ymm2 instruction blends 
elements from registers YMMO and YMM1 into YMM3 according to the control values in YMM2. The high- 
order bit of each doubleword element in YMM2 specifies whether the corresponding element from YMMO 
(high-order bit = 0) or YMM1 (high-order bit = 1) is copied to YMM3. Figure 9-7 illustrates the execution of 
this instruction in greater detail. The vblendvps instruction and its double-precision counterpart vblendvpd 
are examples of AVX instructions that require three source operands. Floating-point blend operations using 
an immediate control mask are also possible with the vblendp[d |s ] instructions. 


Initial Values 


-8000.0 -7000.0 -6000.0 -4000.0 -3000.0 -2000.0 -1000.0 ymmi 


00000000h | 80000000h | 00000000h | 00000000h | 80000000h | 00000000h | 80000000h | 80000000h | ymm2 


vblendvps ymm3, ymmO, ymm1, ymm2 


m00 | swo | sooo | a000 | 3000 | 20000 | 10000) vms 


Figure 9-7. Execution of the vblendvps instruction 


Following AvxBlendF32_ in Listing 9-7 is the function Avx2PermuteF32_, which demonstrates use of 
the vpermps and vpermilps instructions. The vpermps instruction permutes (or rearranges) the elements 
of its first source operand (which is 256 bits wide and contains eight single-precision floating-point values) 
according to the indices in the second source operand. The vpermilps (In-Lane Permute of Single-Precision 
Floating-Point Values) instruction performs its permutations using two independent 128-bit wide lanes 
(i.e., bits [255:128] and bits [127:0]). The control indices for an in-lane permutation must range between 
zero and three, and each lane uses its own distinct set of indices. Figure 9-8 illustrates the execution of 
these instructions in greater detail. AVX and AVX2 also include the double-precision floating-point permute 
instructions vpermilpd and vpermpd. 


337 


CHAPTER 9 bn AVX2 PROGRAMMING — PACKED FLOATING-POINT 


Initial Values 


vpermps ymm2,ymml, ymmO 


Initial Values 


vpermilps ymm5, ymm3, ymm4 


Figure 9-8. Execution of the vpermps and vpermilps instructions 
Here is the output for source code example Ch09 07: 


Results for AvxBlendF32 (vblendvps) 


i: O srci: 10.0 src2: 100.0 idx1: Ox80000000 des1: 100.0 
i: í srèi; 20.0 src2: 200.0 idx1: 0x00000000 des1: 20.0 
1: 2- srċi: 30.0 src2: 300.0 idx1: Ox00000000 des1: 30.0 
1: 3 srei: 40.0 src2: 400.0 ïidx1: Ox80000000 des1: 400.0 
i: 4 SEC: 50.0 src2: 500.0 idx1: ox80000000 des1: 500.0 
ix 5 rci: 60.0 src2: 600.0 idx1: Ox00000000 des1: 60.0 
i: 6 srci: 70.0 src2: 700.0 idx1: ox80000000 des1: 700.0 
lt 7- srci: 80.0 src2: 800.0 idx1: 0x00000000 des1: 80.0 
Results for Avx2PermuteF32 (vpermps) 

15 30 syi: 100.0 idx1: 3 desi: 400.0 

is 1 Srci: 200.0 idx1: 7 desi: 800.0 

is 2 rëi: 300.0 idx1: O des1: 100.0 

15 3 -$rci: 400.0 idx1: 4 des1: 500.0 

i: 4 srci: 500.0 idx1: 6 desi: 700.0 

i: 5 srci: 600.0 idx1: 6 desi: 700.0 

i: 6 srci: 700.0 idx1: 1 des1: 200.0 

15. 7 sioi: 800.0 idx1: 2 desi: 300.0 

Results for Avx2PermuteF32 (vpermilps) 

i: -0 Src2: 100.0 idx2: 3 des2: 400.0 

T5 4 - SÉC2: 200.0 idx2: 1 des2: 200.0 

lt 2 -src2: 300.0 idx2: 1 des2: 200.0 


w 
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15 3. SÉC2: 400.0 idx2: 
i: 4 src2: 500.0 idx2: 
1t 5 srž: 600.0 idx2: 
i: 6 src2: 700.0 idx2: 
i: 7 Src2: 800.0 idx2: 


© ONUN 


Data Gather Instructions 


The final source code example of this chapter, Ch09_08, explains how to use the AVX2 gather instructions. 

A gather instruction conditionally loads elements from non-contiguous memory locations (typically an 
array) into an XMM or YMM register. A gather instruction requires a set of indices and a merge control mask 
that specifies which elements to copy. Listing 9-8 shows the source code for example Ch09 08. Chapter 8 
presented an overview of the AVX2 gather instructions, including a graphic (see Figure 8-1) that elucidated 
execution of the vgatherdps instruction. You may find it helpful to review that material prior to perusing the 
source code and discussions in this section. 


Listing 9-8. Example Ch09 08 


#include 
#include 
#include 
#include 
#include 
#include 
#include 


"stdafx.h" 
«string» 
«cstdint» 
«iostream» 
«iomanip» 
«array» 
«stdexcept» 


using namespace std; 


des2: 
des2: 
des2: 
des2: 
des2: 
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300.0 
800.0 
700.0 
500.0 
500.0 


extern "C" void Avx2Gather8xF32 132 (float* y, const float* x, 
const int32 t* indices, const int32 t* masks); 

extern "C" void Avx2Gather8xF32 I64 (float* y, const float* x, 
const int64 t* indices, const int32 t* masks); 

extern "C" void Avx2Gather8xF64 I32 (double* y, const double* x, 
const int32 t* indices, const int64 t* masks); 

extern "C" void Avx2Gather8xF64 I64 (double* y, const double* x, 
const int64 t* indices, const int64 t* masks); 


template «typename T, typename I, typename M, size t N» 
void Print(const string& msg, const array«T, N>& y, const array«I, N»& indices, 


const array«M, N>& merge) 


if (y.size() !- indices.size() || y.size() != merge.size()) 


cout << '\n' << msg << '\n'; 


throw runtime error("Non-conforming arrays - Print"); 
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for (size t i = 0; i < y.size(); i++) 
( 
string merge s - (merge[i] -- 1) ? "Yes" : "No"; 


cout «« "i: " «« setw(2) «« i «« H 
cout << "y: " << setw(10) << y[i] «< " "; 
cout «« "index: " 


cout «« "merge: 


<< setw(4) << indices[i] «« 3 
<< setw(4) << merge s << ‘\n'; 


} 


void Avx2Gather8xF32 I32() 
1 


array«float, 20» x; 


for (size t i = 0; i < x.size(); i++) 
x[i] = (float)(i * 10); 


array«float, 8» y { -1, -1, -1, -1, -1, -1, -1, -1 }; 
array<int32 t, 8» indices { 2, 1, 6, 5, 4, 13, 11, 9 }; 
array<int32 t, 8» merge ( 1, 1, 0, 1, 1, 0, 1, 1 }; 


cout «« fixed «« setprecision(1); 
cout << "\nResults for Avx2Gather8xF32 I32\n"; 


Print("Values before", y, indices, merge); 
Avx2Gather8xF32 132 (y.data(), x.data(), indices.data(), merge.data()); 
Print("Values after", y, indices, merge); 


) 


void Avx2Gather8xF32 I64() 
{ 


array«float, 20» x; 


for (size t i = 0; i < x.size(); i++) 
x[i] = (float)(i * 10); 


array«float, 8» y { -1, -1, -1, -1, -1, -1, -1, -1 }; 
array<int64 t, 8» indices { 19, 1, 0, 5, 4, 3, 11, 11 }; 
array<int32 t, 8» merge { 1, 1, 1, 1, 0, 0, 1, 1 }; 


cout << fixed << setprecision(1); 
cout << "\nResults for Avx2Gather8xF32 I64\n"; 


Print("Values before", y, indices, merge); 


Avx2Gather8xF32 I64 (y.data(), x.data(), indices.data(), merge.data()); 
Print("Values after", y, indices, merge); 
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void Avx2Gather8xF64 I32() 


{ 


} 


array<double, 20> x; 


for (size t i = 0; i < x.size(); i++) 
x[i] = (double)(i * 10); 


array«double, 8» y ( -1, -1, -1, -1, -1, -1, -1, -1 }; 
array<int32 t, 8» indices { 12, 11, 6, 15, 4, 13, 18, 3 }; 
array<int64 t, 8» merge ( 1, 1, 0, 1, 1, 0, 1, 0 }; 


cout «« fixed «« setprecision(1); 
cout << "\nResults for Avx2Gather8xF64 I32\n"; 


Print("Values before", y, indices, merge); 
Avx2Gather8xF64 I32 (y.data(), x.data(), indices.data(), merge.data()); 
Print("Values after", y, indices, merge); 


void Avx2Gather8xF64 I64() 


{ 


int 


array<double, 20> x; 


for (size t i = 0; i < x.size(); i++) 
x[i] = (double)(i * 10); 


array«double, 8» y ( -1, -1, -1, -1, -1, -1, -1, -1 }; 
array<int64 t, 8» indices { 11, 17, 1, 6, 14, 13, 8, 8 }; 
array<int64 t, 8» merge ( 1, 0, 1, 1, 1, 0, 1, 1 }; 


cout «« fixed «« setprecision(1); 
cout << "\nResults for Avx2Gather8xF64 I64\n"; 


Print("Values before", y, indices, merge); 
Avx2Gather8xF64 I64 (y.data(), x.data(), indices.data(), merge.data()); 
Print("Values after", y, indices, merge); 


main() 


Avx2Gather8xF32 I32(); 
Avx2Gather8xF32 I64(); 
Avx2Gather8xF64 I32(); 
Avx2Gather8xF64 I64(); 
return 0; 
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; For each of the following functions, the contents of y are loaded 
; into ymmo prior to execution of the vgatherXXX instruction in order to 
; demonstrate the effects of conditional merging. 


.code 
; extern "C" void Avx2Gather8xF32 I32 (float* y, const float* x, const int32 t* indices, 


const int32 t* merge) 


Avx2Gather8xF32 132 proc 


vmovups ymmo,ymmword ptr [rcx] ;ymmo = y[7]:y[0] 

vmovdqu ymm1,ymmword ptr [r8] ;ymm1 = indices[7]:indices[0] 
vmovdqu ymm2,ymmword ptr [r9] ;ymm2 - merge[7]:merge[o] 

vpslld ymm2,ymm2,31 ;shift merge vals to high-order bits 
vgatherdps ymmO,[rdx+ymm1*4],ymm2 ^ ;ymmo = gathered elements 

vmovups ymmword ptr [rcx],ymmo ;save gathered elements 

vzeroupper 

ret 


Avx2Gather8xF32 I32 endp 


; extern "C" void Avx2Gather8xF32 I64 (float* y, const float* x, const int64 t* indices, 
const int32 t* merge) 


Avx2Gather8xF32 I64 proc 


vmovups xmmo,xmmword ptr [rcx] ;xmmo = y[3]:y[0] 

vmovdqu ymm1,ymmword ptr [r8] ;ymm1 = indices[3]:indices[0] 
vmovdqu xmm2,xmmword ptr [r9] ;xmm2 = merge[3]:merge[o] 

vpslld xmm2,xmm2,31 ;shift merge vals to high-order bits 
vgatherqps xmmO,[rdx+ymm1*4],xmm2  ;xmmO = gathered elements 

vmovups xmmword ptr [rcx],xmmo ;save gathered elements 


vmovups xmm3,xmmword ptr [rcx+16]  ;xmmO 
vmovdqu ymm1,ymmword ptr [r8+32] ;ymm1 = indices[7]:indices[4] 
vmovdqu xmm2,xmmword ptr [r9+16] ;xmm2 = merge[7]:merge[4] 

vpslld xmm2,xmm2,31 ;shift merge vals to high-order bits 
vgatherqps xmm3,[rdx+ymm1*4],xmm2  ;xmmO = gathered elements 

vmovups xmmword ptr [rcx+16],xmm3 ^ ;save gathered elements 


des[7]:des[4] 


vzeroupper 
ret 
Avx2Gather8xF32 I64  endp 


; extern "C" void Avx2Gather8xF64 I32 (double* y, const double* x, const int32 t* indices, 
const int64 t* merge) 
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Avx2Gather8xF64 I32 proc 
vmovupd ymmO,ymmword ptr [rcx] 
vmovdqu xmm1,xmmword ptr [r8] 
vmovdqu ymm2,ymmword ptr [r9] 
vpsllq ymm2,ymm2,63 
vgatherdpd ymmo, [rdx+xmm1*8 ] , ymm2 
vmovupd ymmword ptr [rcx],ymmo 


vmovupd ymmo,ymmword ptr [rcx+32] 
vmovdqu xmm1,xmmword ptr [r8+16] 
vmovdqu ymm2,ymmword ptr [r9+32] 
vpsllq ymm2,ymm2,63 

vgatherdpd ymmo, [rdx+xmm1*8 ] , ymm2 
vmovupd ymmword ptr [rcx«32], ymmo 


vzeroupper 
ret 
Avx2Gather8xF64 132 endp 
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;ymmo = y[3]:y[0] 

;xmm1 = indices[3]:indices[0] 

;ymm2 = merge[3]:merge[o] 

;shift merge vals to high-order bits 
;ymmO = gathered elements 

;save gathered elements 


symmo = y[7]:y[4] 
;xmm1 = indices[7]:indices[4] 
;ymm2 = merge[7]:merge[4] 


;shift merge vals to high-order bits 
;ymmO = gathered elements 
;save gathered elements 


3 extern "C" void Avx2Gather8xF64 164 (double* y, const double* x, const int64 t* indices, 


const int64 t* merge) 


Avx2Gather8xF64 I64 proc 
vmovupd ymmo,ymmword ptr [rcx] 
vmovdqu ymm1,ymmword ptr [r8] 
vmovdqu ymm2,ymmword ptr [r9] 
vpsllq ymm2,ymm2,63 
vgatherqpd ymmo, [rdx+ymm1*8 ] , ymm2 
vmovupd ymmword ptr [rcx],ymmo 


vmovupd ymmo,ymmword ptr [rcx+32] 
vmovdqu ymm1,ymmword ptr [r8+32] 
vmovdqu ymm2,ymmword ptr [r9+32] 
vpsllq ymm2,ymm2,63 

vgatherqpd ymmo, [rdx+ymm1*8 ] , ymm2 
vmovupd ymmword ptr [rcx«32], ymmo 


vzeroupper 
ret 

Avx2Gather8xF64 I64 endp 
end 


;ymmo = y[3]:y[0] 

;ymm1 = indices[3]:indices[0o] 

;ymm2 = merge[3]:merge[o] 

;shift merge vals to high-order bits 
;ymmO = gathered elements 

;save gathered elements 


symmo = y[7]:y[4] 
;ymm1 = indices[7]:indices[4] 
;ymm2 = merge[7]:merge[4] 


;shift merge vals to high-order bits 
;ymmO = gathered elements 
;save gathered elements 


The C++ source code in example Ch09_08 includes four functions that initialize test cases to perform 


single-precision and double-precision floating-point gather operations using signed doubleword or 
quadword indices. The function Avx2Gather8xF32_132 begins by initializing the elements of array x (the 
source array) with test values. Note that this function uses the STL class array<> instead of a raw C++ array 
to demonstrate use of the former with an assembly language function. Appendix A contains a list of C++ 
references that you can consult if you're interested in learning more about this class. Next, each element in 
array y (the destination array) is set to -1.0 in order to illustrate the effects of conditional merging. The arrays 
indices and merge are also primed with the required gather instruction indices and merge control mask 
values, respectively. The assembly language function Avx2Gather8xF32 I32 is then called to carry out the 


343 


CHAPTER 9 bn AVX2 PROGRAMMING — PACKED FLOATING-POINT 


gather operation. Note that raw pointers for the various STL arrays are obtained using template function 
array<>.data. The other C++ functions in this source example—Avx2Gather8xF32 164, Avx2Gather8xF64_ 
132, and Avx2Gather8xF64 I64—are similarly structured. 

The assembly language function Avx2Gather8xF32 I32 begins by loading registers YMMO, YMMI, 
and YMM2 with the test arrays y, indices, and merge, respectively. Register RDX contains a pointer to the 
source array x. The vpslld ymm2,ymm2,31 instruction shifts the merge control mask values (each value 
in this mask is zero or one) to the high-order bit of each doubleword element. The ensuing vgatherdps 
ymmo, [xdx+ymm1*4] , ymm2 instruction loads eight single-precision floating-point values from array x into 
register YMMO. The merge control mask in YMM2 dictates which array elements are actually copied into the 
destination operand YMMO. If the high-order bit of a merge control mask doubleword element is set to 1, the 
corresponding element in YMMO is updated; otherwise, it is not changed. Subsequent to the successful load 
of an array element, the vgatherdps instruction sets the corresponding doubleword element in the merge 
control mask to zero. The vmovups ymmword ptr [rcx], ymmo then saves the gather result to y. 

The assembly language functions Avx2Gather8xF32 I64 ,Avx2Gather8xF64 I32 ,and 
Avx2Gather8xF64 I64 are analogous to Avx2Gather8xF32 I32 .Note that the gather instructions used in 
these functions—vgatheraps, vgatherdpd, and vgatherqpd—gather only four elements, which explains why 
they're used twice. Here are the results for source code example Ch09 08: 


Results for Avx2Gather8xF32 I32 


Values before 


de 0. yi -1.0 index: 2 merge: Yes 
i: 1 y: -1.0 index: 1 merge: Yes 
i: 2 y -1.0 index: 6 merge: No 
i: 3 y: -1.0 index: 5 merge: Yes 
i: 4 y: -1.0 index: 4 merge: Yes 
i: 5 y: -1.0 index: 13 merge: No 
i: 6 y -1.0 index: 11 merge: Yes 
i: 7 y -1.0 index: 9 merge: Yes 
Values after 

i: 0 y: 20.0 index: 2 merge: Yes 
i: 1 y: 10.0 index: 1 merge: Yes 
i: 2 y: -1.0 index: 6 merge: No 
i: 3 y: 50.0 index: 5 merge: Yes 
i: 4 y: 40.0 index: 4 merge: Yes 
i: 5 y: -1.0 index: 13 merge: No 
i: 6 y: 110.0 index: 11 merge: Yes 
i: 7 y: 90.0 index: 9 merge: Yes 


Results for Avx2Gather8xF32 I64 


Values before 


is 0 yi -1.0 index: 19 merge: Yes 
i: 1 y: -1.0 index: 1 merge: Yes 
i: 2 y: -1.0 index: O merge: Yes 
i: 3 y: -1.0 index: 5 merge: Yes 
i: 4 y: -1.0 index: 4 merge: No 
i: 5 y: -1.0 index: 3 merge: No 
i: 6 y: -1.0 index: 11 merge: Yes 
i: 7 y: -1.0 index: 11 merge: Yes 
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Values after 


Results for Avx2Gather8xF64 I32 


Values before 


Values after 


Results for Avx2Gather8xF64 I64 


Values before 


i: 0 y: 110 
i: 1 y: -1 
it 2 y: 10 
i: 3 y: 60 


oo © © © © © © 


i 0 y: 190. 
i: 1 y: 10 
dE 2 y: 0 
i: 3 y: 50 
i: 4 y: -1 
is 5b. -yi -1 
i: 6 y: 110 
i: 7 y: 110 


i: 0 y: -1. 
ie. À y -1. 
i: 2 y -1. 
i: 3 y: -1. 
i: 4 y: -1. 
i: 5 y: -1. 
i: 6 y -1. 
i: 7 y E 


i: 0 y: 120. 
i: 1 y: 110. 
dE 2 ds -1. 
i: 3 y: 150. 
i: 4 y: 40. 
32.5. “yi -1. 
i: 6 y: 180. 
i: 7 y: -1. 


i: 0 y: -1. 
i: 1 y -1. 
js 2 y -1. 
dé 3e yi -1. 
i: 4 y: -1. 
ii 5 yi -1. 
i: 6 y -1. 
i: 7 y sf, 
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index: 
index: 
index: 
index: 
index: 
index: 
index: 


index: 
index: 
index: 
index: 
index: 
index: 
index: 
index: 


index: 
index: 
index: 
index: 
index: 
index: 
index: 
index: 
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p pe He He 
Now f 


Mo oo 


140.0 index: 14 merge: Yes 
-1.0 index: 13 merge: No 
80.0 index: 8 merge: Yes 
80.0 index: 8 merge: Yes 


Summary 


Here are the key learning points of Chapter 9: 
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Nearly all AVX packed single-precision and double-precision floating-point 
instructions can be used with either 128-bit or 256-bit wide operands. Packed 
floating-point operands should always be properly aligned whenever possible, as 
described in this chapter. 


The MASM align directive cannot be used to align a 256-bit wide operand on a 32- 
byte boundary. Assembly language code can align 256-bit wide constant or mutable 
operands on a 32-byte boundary using the MASM segment directive. 


When performing packed arithmetic operations, the vcmpp [d |s] instructions can be 
used with the vandp[d|s], vandnp[d|s], and vorp[d|s] instructions to make logical 
decisions without any conditional jump instructions. 


The non-associativity of floating-point arithmetic means that minute numerical 
discrepancies may occur when comparing values calculated using C++ and assembly 
language functions. 


Assembly language functions can use the vperm2f128, vpermp[d|s], and 
vpermilp[d|s] instructions to rearrange the elements of a packed floating-point 
operand. 


Assembly language functions can use the vblendp[d|s] and vblendvp[d|s] 
instructions to interleave the elements of two packed floating-point operands. 


Assembly language functions can use the vgatherdp[d|s] and vgatherqp[d|s] 
instructions to conditionally load floating-point values from non-contiguous 
memory locations into an XMM or YMM register. 


Assembly language functions that perform calculations using a YMM register should 
also use a vzeroupper instruction prior any epilog code or the ret instruction in 
order to avoid potential x86-AVX to x86-SSE state transition performance delays. 


CHAPTER 10 


AVX2 Programming - Packed 
Integers 


In Chapter 7, you learned how to use the AVX instruction set to perform packed integer operations using 128- 
bit wide operands and the XMM register set. In this chapter, you learn how to carry out similar operations using 
AVX2 instructions with 256-bit wide operands and the YMM register set. Chapter 10's source code examples 
are divided into two major sections. The first section contains elementary examples that illustrate basic 
operations using AVX2 instructions and 256-bit wide packed integer operands. The second section includes 
examples that are a continuation of the image processing techniques first presented in Chapter 7. 

All ofthe source code examples in this chapter require a processor and operating system that 
supports AVX2. You can use one of the free utilities listed in Appendix A to verify the processing capabilities 
of your system. 


Packed Integer Fundamentals 


In this section, you learn how to perform fundamental packed integer operations using AVX2 instructions. 
The first source code example expounds basic arithmetic using 256-bit wide operands and the YMM 
register set. The second source code example demonstrates AVX2 instructions that carry out integer pack 
and unpack operations. This example also explains how to return a structure by value from an assembly 
language function. The final source code example illuminates AVX2 instructions that execute packed integer 
size promotions using zero or sign extended values. 


Basic Arithmetic 


Listing 10-1 shows the source code for example Ch10 01. This example illustrates how to perform basic 
arithmetic operations using packed word and doubleword operands. 


Listing 10-1. Example Ch10 01 


#include "stdafx.h" 
#include <iostream> 
#include «iomanip» 
#include "Ymmval.h" 
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using namespace std; 


extern "C" void Avx2PackedMathI16 (const YmmVal& a, const YmmVal& b, YmmVal c[6]); 
extern "C" void Avx2PackedMathI32 (const YmmVal& a, const YmmVal& b, YmmVal c[5]); 


void Avx2PackedMath116(void) 


( 

alignas(32) YmmVal a; 
alignas(32) YmmVal b; 
alignas(32) YmmVal c[6]; 
a.m I16[0] = 10; b.m I16[0] - 1000; 
a.m I16[1] = 20; b.m I16[1] = 2000; 
a.m I16[2] = 3000; b.m I16[2] = 30; 
a.m I16[3] = 4000; b.m I16[3] = 40; 
a.m I16[4] = 30000; b.m I16[4] = 3000; // add overflow 
a.m I16[5] = 6000; b.m I16[5] = 32000; // add overflow 
a.m I16[6] = 2000; b.m I16[6] = -31000; // sub overflow 
a.m I16[7] = 4000; b.m I16[7] = -30000; // sub overflow 
a.m I16[8] = 4000; b.m I16[8] = -2500; 
a.m I16[9] - 3600; b.m I16[9] - -1200; 
a.m I16[10] = 6000; b.m I16[10] = 9000; 
a.m I16[11] = -20000; b.m 116[11] = -20000; 
a.m I16[12] = -25000; b.m 116[12] = -27000; // add overflow 
a.m I16[13] - 8000; b.m I16[13] - 28700; // add overflow 
a.m I16[14] = 3; b.m I16[14] = -32766; // sub overflow 
a.m I16[15] = -15000; b.m I16[15] = 24000; // sub overflow 
Avx2PackedMathI16 (a, b, c); 
cout <<"\nResults for Avx2PackedMathI16 MnWn"; 
cout «« " i a b  vpaddw vpaddsw  vpsubw vpsubsw vpminsw vpmaxsw\n"; 
cout << "-------------------------------------------------------------------------- \n"; 
for (int i = 0; i < 16; i++) 
{ 

cout << setw(2) << i << ' ' 

cout << setw(8) << a.m I16[i] << ' '; 

cout << setw(8) << b.m I16[i] << ' '; 

cout << setw(8) << c[o].m I16[i] << ' '; 

cout << setw(8) << c[1].m I16[i] << ' '; 

cout << setw(8) << c[2].m I16[i] << ' '; 

cout << setw(8) << c[3].m I16[i] << ' '; 

cout << setw(8) << c[4].m I16[i] << ' '; 

cout << setw(8) << c[5].m I16[i] << ‘\n'; 
} 

} 
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void Avx2PackedMathI32(void) 


{ 
alignas(32) YmmVal a; 
alignas(32) YmmVal b; 
alignas(32) YmmVal c[6]; 
a.m I32[0] = 64; b.m I32[0] = 4; 
a.m I32[1] = 1024; b.m I32[1] = 5; 
a.m I32[2] = -2048; b.m I32[2] = 2; 
a.m I32[3] = 8192; b.m I32[3] = 5; 
a.m I32[4] = -256; b.m 132[4] = 8; 
a.m I32[5] = 4096; b.m I32[5] = 7; 
a.m I32[6] = 16; b.m I32[6] = 3; 
a.m I32[7] = 512; b.m I32[7] = 6; 
Avx2PackedMathI32 (a, b, c); 
cout << "AnResults for Avx2PackedMathI32 Nn Nn"; 
cout «« " i a b  vpaddd  vpsubd vpmulld vpsllvd vpsravd  vpabsdWn"; 
cout << "---------------------------------------------------------------------- \n"; 
for (int i = 0; i < 8; ie) 
{ 
cout << setw(2) << i << ' '; 
cout << setw(6) << a.m I32[i] << ' '; 
cout << setw(6) << b.m I32[i] << ' '; 
cout << setw(8) << c[o].m 132[i] << ' '; 
cout << setw(8) << c[1].m I32[i] << ' '; 
cout << setw(8) << c[2].m I32[i] << ' '; 
cout << setw(8) << c[3].m I32[i] << ' '; 
cout << setw(8) << c[4].m I32[i] << ' '; 
cout << setw(8) << c[5].m 132[i] << ‘\n'; 
} 
} 
int main() 
{ 
Avx2PackedMath116(); 
Avx2PackedMathI32(); 
return 0; 
) 
m ——————————————— 
H Ch10 01.asm 


b 
; extern "C" void Avx2PackedMathI16 (const YmmVal& a, const YmmVal& b, YmmVal c[6]) 
.code 


Avx2PackedMathlI16 proc 
; Load values a and b, which must be properly aligned 
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vmovdqa ymmo,ymmword ptr [rcx] ;ymmo = a 
vmovdqa ymm1,ymmword ptr [rdx] ;ymm1 = b 

; Perform packed arithmetic operations 
vpaddw ymm2, ymmo, ymm1 ;add 
vmovdqa ymmword ptr [r8],ymm2 ;save vpaddw result 
vpaddsw ymm2, ymmo, ymm1 ;add with signed saturation 
vmovdqa ymmword ptr [r8432],ymm2 ;save vpaddsw result 
vpsubw ymm2, ymmo, ymm1 ; sub 
vmovdqa ymmword ptr [r8+64],ymm2 ;save vpsubw result 
vpsubsw ymm2, ymmo, ymm1 ;sub with signed saturation 
vmovdqa ymmword ptr [r8496],ymm2 ;save vpsubsw result 
vpminsw ymm2, ymmo, ymm1 3signed minimums 
vmovdqa ymmword ptr [r8+128],ymm2 save vpminsw result 
vpmaxsw ymm2,ymmO, ymm1 3signed maximums 
vmovdqa ymmword ptr [r8+160],ymm2 save vpmaxsw result 


vzeroupper 
ret 
Avx2PackedMathI16_ endp 


3 extern "C" void Avx2PackedMathI32_(const YmmVal& a, const YmmVal& b, YmmVal c[6]) 
Avx2PackedMathI32 proc 
; Load values a and b, which must be properly aligned 

vmovdqa ymmo,ymmword ptr [rcx] ;ymmo = a 


vmovdqa ymm1,ymmword ptr [rdx] ;ymm1 = b 


; Perform packed arithmetic operations 


vpaddd ymm2, ymmo, ymm1 ;add 

vmovdqa ymmword ptr [r8],ymm2 ;save vpaddd result 
vpsubd ymm2, ymmo, ymm1 ;sub 

vmovdqa ymmword ptr [r8+32],ymm2 ;save vpsubd result 
vpmulld ymm2, ymmo, ymmi ;signed mul (low 32 bits) 
vmovdqa ymmword ptr [r8464],ymm2 ;save vpmulld result 
vpsllvd ymm2,ymmo,ymmi ;shift left logical 
vmovdqa ymmword ptr [r8496],ymm2 ;save vpsllvd result 
vpsravd ymm2,ymmO, ymm1 ;shift right arithmetic 
vmovdqa ymmword ptr [r8+128],ymm2 ^ ;save vpsravd result 
vpabsd ymm2,ymmo ;absolute value 

vmovdqa ymmword ptr [r8+160],ymm2 ^ ;save vpabsd result 


350 


CHAPTER 10 bn AVX2 PROGRAMMING — PACKED INTEGERS 


vzeroupper 
ret 

Avx2PackedMathI32  endp 
end 


The C++ function Avx2PackedMathI16 contains code that demonstrates packed signed word arithmetic. 
This function begins with the definitions of YmmVal variables a, b, and c. Note that the C++ specifier 
alignas(32) is used with each YmmVal definition to ensure alignment on a 32-byte boundary. The signed 
word elements of both a and b are then initialized with test values. Following variable initialization, 
Avx2PackedMathI16 calls the assembly language function Avx2PackedMathI16 , which performs several 
packed arithmetic operations. The results are then streamed to cout. The C++ function Avx2PackedMathI32 
is next. The structure of this function is similar to Avx2PackedMathI16, with the main difference being that it 
exercises packed doubleword operands. 

The assembly language function Avx2PackedMathI16 begins with a vnovdqa ymmo, ymmword ptr 
[rcx] instruction that loads YmmVal a into register YMMO. The ensuing vmovdqa ymm1,ymmword ptr [rdx] 
instruction loads YmmVal b into register YMM1. This is followed by a vpaddw ymm2, ymmo, ymm1 that performs 
packed word addition of a and b. The vmovdqa ymmword ptr [r8],ymm2 instruction then saves packed word 
sums to c[0]. The remaining assembly language code in Avx2PackedMathI16 exercises the instructions 
vpaddsw, vpsubw, vpsubsw, vpminsw, and vpmaxsw to carry out additional arithmetic operations. Similar to 
the source code examples that you saw in Chapter 9, Avx2PackedMathI16 uses a vzeroupper instruction 
before its ret instruction. This avoids potential performance penalties that can occur when the processor 
transitions from executing x86-AVX instructions to x86-SSE instructions as explained in Chapter 8. The 
assembly language function Avx2PackedMathI32 employs a similar structure to exercise commonly-used 
packed doubleword instructions including vpaddd, vpsubd, vpmulld, vpsllvd, vpsravd, and vpabsd. Here are 
the results for source code example Ch10 01: 


Results for Avx2PackedMathI16 | 


i a b  vpaddw vpaddsw  vpsubw vpsubsw vpminsw vpmaxsw 
0 10 1000 1010 1010 -990 -990 10 1000 
1 20 2000 2020 2020 -1980 -1980 20 2000 
2 3000 30 3030 3030 2970 2970 30 3000 
3 4000 40 4040 4040 3960 3960 40 4000 
4 30000 3000 -32536 32767 27000 27000 3000 30000 
5 6000 32000 -27536 32767 -26000 -26000 6000 32000 
6 2000 -31000 -29000 -29000 -32536 32767 -31000 2000 
7 4000 -30000 -26000 -26000 -31536 32767 -30000 4000 
8 4000 -2500 1500 1500 6500 6500 -2500 4000 
9 3600 -1200 2400 2400 4800 4800 -1200 3600 
10 6000 9000 15000 15000 -3000 -3000 6000 9000 
11 -20000 -20000 25536 -32768 0 0 -20000 -20000 
12 -25000 -27000 13536 -32768 2000 2000 -27000 -25000 
13 8000 28700 -28836 32767 -20700 -20700 8000 28700 
14 3 -32766 -32763 -32763 -32767 32767 -32766 3 
15 -15000 24000 9000 9000 26536 -32768 -15000 24000 
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Results for Avx2PackedMathI32 


i a b  vpaddd  vpsubd vpmulld vpsllvd vpsravd  vpabsd 
0 64 4 68 60 256 1024 4 64 
1 1024 5 1029 1019 5120 32768 32 1024 
2 -2048 2 -2046 -2050 -4096 -8192 -512 2048 
3 8192 5 8197 8187 40960 262144 256 8192 
4  -256 8 -248 -264 -2048 -65536 -1 256 
5 4096 7 4103 4089 28672 524288 32 4096 
6 16 3 19 13 48 128 2 16 
7 512 6 518 506 3072 32768 8 512 


On systems that support AVX2, most of the instructions exercised in this example can be used with a 
variety of 256-bit wide packed integer operands. For example, the vpadd[b|q] and vpsub[b |q] instructions 
carry out addition and subtraction using 256-bit wide packed byte or quadword operands. The vpaddsb 
and vpsubsb instructions perform signed saturated addition and subtraction using packed byte operands. 
The instructions vpmins[b|d] and vpmaxs[b |d] calculate packed signed minimums and maximums, 
respectively. The variable bit shift instructions vpsllv[d|q], vpsravd, and vpsrlv[d|q] are new AVX2 
instructions. These instructions are not available on systems that only support AVX. 


Pack and Unpack 


Then next source code example illustrates how to perform integer pack and unpack operations. These 
operations are often employed to size-reduce or size-promote packed integer operands. This example also 
explains how to return a structure by value from an assembly language function. Listing 10-2 shows the 
source code for example Ch10 02 


Listing 10-2. Example Ch10 02 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 
#include "YmmVal.h" 


using namespace std; 

struct alignas(32) YmmVal2 
YmmVal m YmmValo; 
YmmVal m YmmVali; 


extern "C" YmmVal2 Avx2UnpackU32 U64 (const YmmVal& a, const YmmVal& b); 
extern "C" void Avx2PackI32 I16 (const YmmVal& a, const YmmVal& b, YmmVal* c); 
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void Avx2UnpackU32 U64(void) 


{ 
alignas(32) YmmVal a; 
alignas(32) YmmVal b; 
a.m U32[0] = 0x00000000; b.m U32[0] = 0x88888888; 
a.m U32[1] = Ox11111111; b.m U32[1] = 0x99999999; 
a.m U32[2] = 0x22222222; b.m U32[2] = Oxaaaaaaaa; 
a.m U32[3] = 0x33333333; b.m U32[3] = Oxbbbbbbbb; 
a.m U32[4] = 0x44444444; b.m U32[4] = Oxcccccccc; 
a.m U32[5] = 0x55555555; b.m U32[5] = oxdddddddd; 
a.m U32[6] = 0x66666666; b.m U32[6] = Oxeeeeeeee; 
a.m U32[7] = Ox77777777; b.m U32[7] = Oxffffffff; 
YmmVal2 c - Avx2UnpackU32 U64 (a, b); 
cout << "\nResults for Avx2UnpackU32_U64\n\n"; 
cout «« "a lo " << a.ToStringX32(0) << ‘\n'; 
cout << "b lo " << b.ToStringX32(0) << ‘\n'; 
cout << ‘\n'; 
cout << "a hi " << a.ToStringX32(1) << ‘\n'; 
cout << "b hi " << b.ToStringX32(1) << ‘\n'; 
cout << "Anvpunpckldq result\n"; 
cout << "c.m YmmValo lo " << c.m YmmValo.ToStringX64(0) << ‘\n'; 
cout << "c.m YmmValo hi " << c.m YmmValo.ToStringX64(1) << ‘\n'; 
cout << "\nvpunpckhdg result\n"; 
cout << "c.m YmmVali lo " << c.m YmmVal1.ToStringX64(0) << ‘\n'; 
cout << "c.m YmmVali hi " << c.m YmmVal1.ToStringX64(1) << ‘\n'; 
} 
void Avx2PackI32 I16(void) 
( 


alignas(32) YmmVal a; 
alignas(32) YmmVal b; 
alignas(32) YmmVal c; 


a.m I32[0] = 10; b.m I32[0] = 32768; 
a.m I32[1] = -200000; b.m I32[1] = 6500; 
a.m I32[2] = 300000; b.m I32[2] = 42000; 
a.m 132[3] = -4000; b.m I32[3] = -68000; 
a.m I32[4] = 9000; b.m I32[4] = 25000; 
a.m I32[5] = 80000; b.m I32[5] = 500000; 
a.m I32[6] = 200; b.m I32[6] = -7000; 
a.m I32[7] = -32769; b.m I32[7] = 12500; 
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Avx2Pack132 116 (a, b, 8c); 
cout << "\nResults for Avx2PackI32 I16 WW"; 


cout << "a lo " << a.ToStringI32(0) << ‘\n'; 
cout << "a hi " << a.ToStringI32(1) << ‘\n'; 
cout << ‘\n'; 


cout << "b lo " << b.ToStringI32(0) << ‘\n'; 
cout << "b hi " << b.ToStringI32(1) << ‘\n'; 
cout << ‘\n'; 


cout << "c lo " << c.ToStringI16(0) << ‘\n'; 
cout << "c hi " << c.ToStringI16(1) << ‘\n'; 


cout << ‘\n'; 


} 
int main() 
{ 

Avx2UnpackU32 U64(); 

Avx2Pack132 I16(); 

return 0; 
) 
| 
; Ch10 02.asm 


3 
; extern "C" YmmVal2 Avx2UnpackU32 U64 (const YmmVal& a, const YmmVal& b); 


.code 
Avx2UnpackU32 U64 proc 


; Load argument values 


vmovdqa ymmo,ymmword ptr [rdx] ;ymmo = a 
vmovdqa ymm1,ymmword ptr [r8] ;ymm1 = b 
; Perform dword to qword unpacks 
vpunpckldq ymm2, ymmo, ymm1 ;unpack low doublewords 
vpunpckhdq ymm3, ymmo, ymm1 ;unpack high doublewords 


; Save result to YmmVal2 buffer 
vmovdqa ymmword ptr [rcx], ymm2 ;save low result 
vmovdqa ymmword ptr [rcx«32],ymm3 ^ ;save high result 


mov rax,rcx 3rax = ptr to YmmVal2 
vzeroupper 


ret 
Avx2UnpackU32 U64 endp 
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; extern "C" void Avx2PackI32 116 (const YmmVal& a, const YmmVal& b, YmmVal* c); 
Avx2PackI32 I16 proc 
; Load argument values 


vmovdqa ymmo,ymmword ptr [rcx] ; ymmo 
vmovdqa ymm1,ymmword ptr [rdx] ;ymm1 = b 


M 
£D 


; Perform pack dword to word with signed saturation 


vpackssdw ymm2, ymmo, ymm1 ;ymm2 = packed words 
vmovdqa ymmword ptr [r8],ymm2 ;save result 
vzeroupper 

ret 


Avx2Pack132 I16  endp 


Foo1_ proc 
ret 

Foo1 endp 
end 


The C++ code in Listing 10-2 begins the declaration of a structure named YmmVal2. This structure 
contains two YmmVal members: m_YmmValo and m YmmVal1. Note that the alignas(32) specifier is used 
immediately after the keyword struct. Using this specifier ensures that all instances of YnmVal2 are aligned 
on a 32-byte boundary including temporary instances created by the compiler. More on this in a moment. 
The assembly language function Avx2UnpackU32 U64 , whose declaration follows, returns an instance of 
YmmVal2 by value. 

The C++ function AvxUnpackU32 U64 begins by initializing the unsigned doubleword elements of 
YmmVal variables a and b. Following variable initialization is the statement YnmVal2 c = Avx2UnpackU32 
U64 (a, b), which calls the assembly language function Avx2UnpackU32 U64 to unpack the elements of a 
and b from doublewords to quadwords. Unlike previous examples, Avx2UnpackU32 U64 returns its YmmVal2 
result by value. Before proceeding, it is important to note that in most cases, returning a user-defined 
structure like YmmVa12 by value is less efficient than passing a pointer argument to a variable of type YmmVal2. 
The function Avx2UnpackU32 U64 uses return-by-value principally for demonstration purposes and to 
elucidate the Visual C++ calling convention protocols that an assembly language function must observe 
when returning a structure by value is warranted. The remaining statements in AvxUnpackU32 U64 stream 
the results from Avx2UnpackU32 U64 to cout. 

Following AvxUnpackU32 U64 is the C++ function Avx2Pack132 116. This function initializes the 
signed doubleword elements of YmmVal variables a and b. These values will be size reduced to packed 
words. Subsequent to YmmVal variable initialization, Avx2Pack132 116 calls the assembly language function 
Avx2Pack132 I16 to carry out the aforementioned size reduction. The results are then streamed to cout. 

The calling convention that Visual C++ uses for functions that return a structure by value varies 
somewhat from the normal calling convention. Upon entry to the assembly language function 
Avx2UnpackU32 U64 , register RCX points to a temporary buffer where Avx2UnpackU32 U64 must store its 
YmmVal2 return result. It is important to note that this buffer is not necessarily the same memory location 
as the destination YnmVal2 variable in the C++ statement that called Avx2UnpackU32_U64_. In order to 
implement expression evaluation and operator overloading, a C++ compiler often generates code that 
allocates temporary variables (or rvalues) to hold intermediate results. An rvalue that needs to be saved is 
ultimately copied to a named variable (or lvalue) using either a default or overloaded assignment operator. 
This copy operation is the reason why returning a structure by value is usually slower than passing a pointer 
argument. The alignas(32) specifier that's used in the declaration of struct YmmVal2 directs the Visual C++ 
compiler to align all variables of type YmmVa12 including rvalues on a 32-byte boundary. 
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If the subject matter of the preceding paragraph seems a little abstract, don't worry. Temporary storage 
space allocation for return-by-value structures is handled automatically by the C++ compiler. It's more 
important to understand the following Visual C++ calling convention requirements that must be observed by 
any function that returns a large structure (any structure whose size is greater than eight bytes) by value: 


e The caller of a function that returns a large structure by value must allocate storage 
space for the returned structure. A pointer to this storage space must be passed to 
the called function in register RCX. 


e The normal calling convention argument registers are "right-shifted" by one. This 
means that the first three arguments are passed using registers RDX/XMM1, R8/ 
XMM2, and R9/XMM3. Any remaining arguments are passed on the stack. 


e Prior to returning, the called function must load register RAX with a pointer to the 
returned structure. 


Ifthe size of a return-by-value structure is less than or equal to eight bytes, it must be returned in 
register RAX. The normal calling convention argument registers are used in these situations. 

Returning to the code, the first instruction of function Avx2UnpackU32 U64 uses a vmovdqa 
ymmo,ymmword ptr [rdx] instruction to load YmmVal a (the first function argument) into register 
YMMO. The ensuing vmovdqa ymm1,ymmword ptr [r8] instruction loads YmmVal b (the second function 
argument) into register YMM1. The next two instructions, vpunpckldq ymm2,ymmO, ymm1 and vpunpckhdq 
ymm3 , ymmo , ymm1, unpack the doublewords into quadwords, as shown in Figure 10-1. The results are then 
saved to the YmmVal2 buffer pointed to by RCX using two vmovdqa instructions. Note that two vmovdqu 
instructions would be required here if the structure YmmVa12 was declared without the alignas (32) specifier. 
As previously mentioned, the Visual C++ calling convention requires any function that returns a structure 
by value to load a copy ofthe structure buffer pointer into register RAX prior to returning. The mov rax,rcx 
instruction fulfills this requirement (recall that RCX contains a pointer to the structure buffer). 


Initial Values ! 


L 


vpunpckldq ymm2,9ymmO, ymm1 H 


DDDDDDDD55555555h CCCCCCCC44444444h 999999991111111h ymm2 
vpunpckhdq ymm3, ymmO, ymrm1 ; 


FFFFFFFF77777777h EEEEEEEE66666666h BBBBBBBB33333333h AAAAAAAA22222222h ymm3 


Figure 10-1. Execution of the vpunpckldq and vpunpckhdq instructions 


The assembly language function Avx2PackI32 I16 demonstrates use of the vpackssdw (Packed with 
Signed Saturation) instruction. In this function, the vpackssdw ymm2, ymmo, ymm1 instruction converts the 16 
doubleword integers in registers YMMO and YMMI to word integers using signed saturation. It then saves 
the 16 word integers in register YMM2. Figure 10-2 illustrates the execution of this instruction. X86-AVX also 
include a vpacksswb instruction that performs signed word to byte size reductions. The vpackus [dw|wb] 
instructions can be used for packed unsigned integer reductions. 


(ea) 


56 


Initial Values 
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vpackssdw ymm2,ymmO, 


ymm1 


Bo COS eS Sco, aa 


ymm1[255:128] 


ymm0[255:128] 


ymm1[127:0] 


Figure 10-2. Execution of the vpackssdw instruction 


ymm0[127:0] 


Note that in Figures 10-1 and 10-2, the vpunpck1dq, vpunpckhdq, and vpackssdw instructions carry out 
their operations using two 128-bit wide independent lanes, as explained in Chapter 4. Here are the results 


for source code example Ch 


Results for Avx2UnpackU 


vpunpckldq result 
c.m YmmValo lo 
c.m YmmValo hi 


vpunpckhdq result 
c.m YmmVali lo 
c.m YmmVali hi 


Results for Avx2PackI32 I16 


a lo 10 
a hi 9000 
b lo 32768 
b hi 25000 
c lo 10 -32768 
c hi 9000 32767 


10 02: 
32 U64 

00000000 11111111 
88888888 99999999 
44444444 55555555 
CCCCCCCC DDDDDDDD 
8888888800000000 
CCCCCCCC44444444 
AAAAAAAA22222222 
EEEEEEEE66666666 

-200000 | 300000 

80000 | 200 

6500 | 42000 

500000 | -7000 

32767 -4000 | 32767 6500 

200 -32768 | 25000 32767 


22222222 
AAAAAAAA 


66666666 
EEEEEEEE 


32767 
- 7000 


33333333 
BBBBBBBB 


77777777 
FFFFFFFF 


9999999911111111 
DDDDDDDD55555555 


BBBBBBBB33333333 
FFFFFFFF77777777 


-4000 
-32769 


-68000 
12500 


-32768 
12500 
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Size Promotions 


In Chapter 7, you learned how to use the used the vpunpckl [bw| dw] and vpunpckh[bw |wd] instructions to 
size-promote packed integers (see source code examples Ch07 05, Ch07 06, and Ch07 08). The next source 
code example, Ch10 03, demonstrates how to employ the vpmovzx[bw|bd] and vpmovsx [wd | wq] instructions 
to size-promote packed integers using either zero or sign extension. Listing 10-3 shows the source code for 
example Ch10 03. 


Listing 10-3. Example Ch10 03 


#include "stdafx.h" 
#include <cstdint> 
#include <iostream> 
#include «string» 

#include "YmmVal.h" 


using namespace std; 

extern "C" void Avx2ZeroExtU8 U16 (YmmVal*a, YmmVal b[ 
extern "C" void Avx2ZeroExtU8 U32 (YmmVal*a, YmmVal b[ 
extern "C" void Avx2SignExtI16 I32 (YmmVal*a, YmmVal b 
extern "C" void Avx2SignExtI16 I64 (YmmVal*a, YmmVal b 


const string c Line(80, '-'); 


void Avx2ZeroExtU8 U16(void) 


( 
alignas(32) YmmVal a; 
alignas(32) YmmVal b[2]; 
for (int i = 0; i < 32; i++) 
a.m U8[i] = (uint8 t)(i * 8); 
Avx2ZeroExtU8 U16 (&a, b); 
cout << "\nResults for Avx2ZeroExtU8 U16 An"; 
cout << c Line << '\n'; 
cout << "a (0:15):  " << a.ToStringU8(0) << ‘\n'; 
cout << "a (16:31): " << a.ToStringU8(1) << ‘\n'; 
cout << ‘\n'; 
cout << "b (0:7): " << b[o].ToStringU16(0) << ‘\n'; 
cout << "b (8:15):  " << b[o].ToStringU16(1) << ‘\n'; 
cout << "b (16:23): " << b[1].ToStringU16(0) << '\n'; 
cout << "b (24:31): " << b[1].ToStringU16(1) << '\n'; 
} 
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void Avx2ZeroExtU8 U32(void) 


{ 


alignas(32) YmmVal a; 
alignas(32) YmmVal b[4]; 


CHAPTER 10 


for (int i = 0; i < 32; i++) 


Avx2ZeroExtU8 U32 (8a, b); 


cout 
cout 


cout 
cout 
cout 
cout 
cout 
cout 
cout 
cout 
cout 
cout 
cout 


) 


<< 


<< c Line << '‘\n'; 


<< 
<< 
<< 
<< 
<< 
<< 
<< 
<< 
<< 
<< 
<< 


a.m U8[i] = (uint8 t)(255 - i * 8); 


"AnResults for Avx2ZeroExtU8 U32 WM"; 


"a (0:15): " 
"a (16:31): " 


void Avx2SignExtI16 I32() 


{ 


alignas(32) YmmVal a; 
alignas(32) YmmVal b[2]; 


<< 
<< 


<< 
<< 
<< 
<< 
<< 
<< 
<< 
<< 


a.ToStringU8(0) << '\n'; 
a.ToStringU8(1) << 'Wn'; 


].ToStringU32(0) << ‘\n'; 
].ToStringU32(1) << ‘\n'; 
].ToStringU32(0) << ‘\n'; 
].ToStringU32(1) << ‘\n'; 
].ToStringU32(0) << ‘\n'; 
].ToStringU32(1) << ‘\n'; 
].ToStringU32(0) << ‘\n'; 
].ToStringU32(1) << ‘\n'; 


S OS © © © © © c 
aaa eo oe 


for (int i = 0; i < 16; i++) 


a.m 116[i] = (int16 t)(-32768 + i * 4000); 


Avx2SignExt116_132_(&a, b); 


cout << "\nResults for Avx2SignExtI16 I32_\n"; 


cout << c Line << ‘\n'; 


cout 
cout 
cout 
cout 
cout 
cout 
cout 


<< 
<< 
<< 
<< 
<< 
<< 
<< 


"a (0:7): 7 
"a (8:15): " 


"b (0:3): ~ 
"b (4:7): : 
"b (8:11): " 
"b (12:15): " 


<< 
<< 


<< 
<< 
<< 
<< 


a.ToStringI16(0) << '\n'; 
a.ToStringI16(1) << '\n'; 


0].ToStringI32(0) << '\n'; 

].ToStringI32(1) << ‘\n'; 
1].ToStringI32(0) << ‘\n'; 
1].ToStringI32(1) << ‘\n'; 
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void Avx2SignExtI16 I64() 


{ 
alignas(32) YmmVal a; 
alignas(32) YmmVal b[4]; 
for (int i = 0; i < 16; i++) 
a.m I16[i] = (int16 t)(32767 - i * 4000); 
Avx2SignExtI16 I64 (8a, b); 
cout << "\nResults for Avx2SignExtI16 I64 Mn"; 
cout << c Line << ‘\n'; 
cout «« "a (0:7): " << a.ToStringI16(0) << ‘\n'; 
cout << "a (8:15):  " << a.ToStringI16(1) << ‘\n'; 
cout << ‘\n'; 
cout << "b (0:1): " << b[o].ToStringl64(0) << ‘\n'; 
cout «« "b (2:3): " << b[o].ToStringl64(1) << ‘\n'; 
cout «« "b (4:5): " << b[1].ToStringl64(0) << ‘\n'; 
cout «« "b (6:7): " << b[1].ToStringl64(1) << ‘\n'; 
cout «« "b (8:9): " << b[2].ToStringl64(0) << ‘\n'; 
cout << "b (10:11): " << b[2].ToStringl64(1) << '\n'; 
cout << "b (12:13): " << b[3].ToStringl64(0) << ‘\n'; 
cout << "b (14:15): " << b[3].ToStringl64(1) << ‘\n'; 
} 
int main() 
{ 
Avx2ZeroExtU8 U16(); 
Avx2ZeroExtU8 U32(); 
Avx2SignExtI116 132(); 
Avx2SignExtI16 I64(); 
return 0; 
) 
——— M a Ie Tes 
3 Ch10 03.asm 


3 


; extern "C" void Avx2ZeroExtU8 U16 (YmmVal*a, YmmVal b[2]); 


.code 

Avx2ZeroExtU8 U16 proc 
vpmovzxbw ymmo,xmmword ptr [rcx] ;zero extend a[0] - a[15] 
vpmovzxbw ymm1,xmmword ptr [rcx416] ;zero extend a[16] - a[31] 
vmovdqa ymmword ptr [rdx],ymmo ;save results 


vmovdqa ymmword ptr [rdx+32],ymm1 


vzeroupper 
ret 
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Avx2ZeroExtU8 U16_ endp 
; extern "C" void Avx2ZeroExtU8 U32 (YmmVal*a, YmmVal b[4]); 


Avx2ZeroExtU8 U32 proc 


vpmovzxbd ymmo,qword ptr [rcx] ;zero extend a[0] - a[7] 
vpmovzxbd ymm1,qword ptr [rcx+8] ;zero extend a[8] - a[15] 
vpmovzxbd ymm2,qword ptr [rcx+16] ;zero extend a[16] - a[23] 
vpmovzxbd ymm3,qword ptr [rcx+24] ;zero extend a[24] - a[31] 
vmovdqa ymmword ptr [rdx],ymmo ;save results 

vmovdqa ymmword ptr [rdx+32],ymm1 

vmovdqa ymmword ptr [rdx464],ymm2 

vmovdqa ymmword ptr [rdx496],ymm3 

vzeroupper 

ret 


Avx2ZeroExtU8 U32 endp 
; extern "C" void Avx2SignExtI16 132 (YmmVal*a, YmmVal b[2]) 


Avx2SignExt116 I32 proc 


vpmovsxwd ymmo,xmmword ptr [rcx] ;sign extend a[0] - a[7] 
vpmovsxwd ymm1,xmmword ptr [rcx416] ;sign extend a[8] - a[15] 
vmovdqa ymmword ptr [rdx],ymmo ;save results 


vmovdqa ymmword ptr [rdx+32],ymm1 
vzeroupper 
ret 
Avx2SignExtI16 I32 endp 
; extern "C" void Avx2SignExtI16 164 (YmmVal*a, YmmVal b[4]) 


Avx2SignExtI16 I64 proc 


vpmovsxwq ymmo,qword ptr [rcx] ;sign extend a[0] - a[3] 
vpmovsxwq ymm1,qword ptr [rcx+8] ;sign extend a[4] - a[7] 
vpmovsxwq ymm2,qword ptr [rcx+16] ;sign extend a[8] - a[11] 
vpmovsxwq ymm3,qword ptr [rcx+24] ;sign extend a[12] - a[15] 
vmovdqa ymmword ptr [rdx],ymmo ;save results 
vmovdqa ymmword ptr [rdx+32],ymm1 
vmovdqa ymmword ptr [rdx464],ymm2 
vmovdqa ymmword ptr [rdx496],ymm3 
vzeroupper 
ret 

Avx2SignExt116 I64 endp 
end 
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The C++ code in Listing 10-3 contains four functions that initialize test cases for various packed size- 
promotion operations. The first function, Avx2ZeroExtU8 U16, begins by initializing the unsigned byte 
elements of YnmVal a. It then calls the assembly language function Avx2ZeroExtU8 U16 to size-promote the 
packed unsigned bytes into packed unsigned words. The function Avx2ZeroExtU8 U32 performs a similar 
set of initializations to demonstrate packed unsigned byte to packed unsigned doubleword promotions. 
The functions Avx2SignExt116 132 and Avx2SignExtI16_I64 initialize test cases for packed signed word to 
packed signed doubleword and packed signed quadword size promotions. 

The first instruction in the assembly language function Avx2ZeroExtU8 U16 , vpmovzxbw ymmo, xmmword 
ptr [rcx], loads and zero-extends the 16 low-order bytes of YmmVal a (pointed to by register RCX) and 
saves these values in register YMMO. The ensuing vpmovzxbw ymm1,xmmword ptr [rcx+16] instruction 
performs the same operation using the 16 high-order bytes of YmmVal a. The function Avx2ZeroExtU8 U16 
then uses two vmovdqa instructions to save the size-promoted results. 

The assembly language function Avx2ZeroExtU8 U32 performs packed byte to doubleword size 
promotions. The first instruction, vpmovzxbd ymmo,qword ptr [rcx], loads and zero-extends the eight 
low-order bytes of YnmVal a into doublewords and saves these values in register YMMO. The three 
ensuing vpmovzxbd instructions size-promote the remaining byte values in YmmVal a. The results are then 
saved using a series of vmovdga instructions. When working with unsigned 8-bit values, it is sometimes 
(depending on the algorithm) more expedient to use the vpmovzxbd instruction to perform a packed byte 
to packed doubleword size promotion instead of a semantically equivalent series of vpunpck1 [ bw| dw] and 
vpunpckh [bw | dw] instructions. You see an example of this in Chapter 14. 

The assembly language functions Avx2SignExtI16 I32 and Avx2SignExt116 I64 demonstrate how 
to use the vpmovsxwd and vpmovsxwq instructions, respectively. These instructions size-promote and sign- 
extend packed word integers to doublewords and quadwords. X86-AVX also includes the packed move with 
sign extension instructions vpmovsx[bw |bd |bq] and vpmovsxdq. Here is the output for source code example 
Ch10 03: 


Results for Avx2ZeroExtU8 U16 


a (0:15): O 8 16 24 32 40 48 56 | 64 72 80 88 96 104 112 120 
a (16:31): 128 136 144 152 160 168 176 184 | 192 200 208 216 224 232 240 248 
b (0:7): 0 8 16 24 | 32 40 48 56 
b (8:15): 64 72 80 88 | 96 104 112 120 
b (16:23): 128 136 144 152 | 160 168 176 184 
b (24:31): 192 200 208 216 | 224 232 240 248 


Results for Avx2ZeroExtU8 U32 


a (0:15): 255 247 239 231 223 215 207 199 | 191 183 175 167 159 151 143 135 


a (16:31): 127 119 111 103 95 87 79 71 | 63 55 47 39 31 23 15 7 
b (0:3): 255 247 | 239 231 
b (4:7): 223 215 | 207 199 
b (8:11): 191 183 | 175 167 
b (12:15): 159 151 | 143 135 
b (16:19): 127 119 | 111 103 
b (20:23): 95 87 | 79 71 
b (24:27): 63 55 | 47 39 
b (28:31): 31 23 | 15 7 
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a (0:7) -32768 -28768 -24768 -20768 | -16768 -12768  -8768  -4768 
a (8:15) -768 3232 7232 11232 | 15232 19232 23232 27232 
b (0:3): -32768 -28768 | -24768 -20768 
b (4:7): -16768 -12768 | -8768 -4768 
b (8:11): -768 3232 | 7232 11232 
b (12:15) 15232 19232 | 23232 27232 
Results for Avx2SignExtI16 I64 

a (0:7): 32767 28767 24767 20767 | 16767 12767 8767 4767 
a (8:15) 767 -3233  -7233 -11233 | -15233 -19233 -23233 -27233 
b (0:1): 32767 | 28767 
b (2:3): 24767 | 20767 
b (4:5): 16767 | 12767 
b (6:7) 8767 | 4767 
b (8:9) 767 | -3233 
b (10:11): -7233 | -11233 
b (12:13): -15233 | -19233 
b (14:15): -23233 | -27233 


Packed Integer Image Processing 


In Chapter 7, you learned how to use the AVX instruction set to perform some common image processing 
operations using 128-bit wide packed unsigned integer operands. The source code examples of this section 
demonstrate additional image processing methods using AXV2 instructions with 256-bit wide packed 
unsigned integer operands. The first source example illustrates how to clip the pixel intensity values of a 
grayscale image. This is followed by an example that determines the minimum and maximum pixel intensity 
values of an RGB image. The final source code example uses the AVX2 instruction set to perform RGB to 
grayscale image conversion. 


Pixel Clipping 


Pixel clipping is an image processing technique that bounds the intensity values of each pixel in an image 
between two threshold limits. This technique is often used to reduce the dynamic range of an image by 
eliminating its extremely dark and light pixels. Source code example Ch10 04 illustrates how to use the 
AVX2 instruction set to clip the pixels of an 8-bit grayscale image. Listing 10-4 shows the C++ and assembly 
language source code for example Ch10 04. 


Listing 10-4. Example Ch10. 04 


#pragma once 
#include «cstdint» 
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// The following structure must match the structure that's declared in the file .asm file 
struct ClipData 


{ 
uint8 t* m Src; // source buffer pointer 
uint8 t* m Des; // destination buffer pointer 
uint64 t m NumPixels; // number of pixels 
uint64 t m NumClippedPixels; // number of clipped pixels 
uint8 t m ThreshLo; // low threshold 
uint8 t m ThreshHi; // high threshold 

n 


// Functions defined in Ch10 04.cpp 
extern void Init(uint8 t* x, uint64 t n, unsigned int seed); 
extern bool Avx2ClipPixelsCpp(ClipData* cd); 


// Functions defined in Ch10 04 .asm 
extern "C" bool Avx2ClipPixels (ClipData* cd); 


// Functions defined in Ch10 04 BM.cpp 
extern void Avx2ClipPixels BM(void); 


#include "stdafx.h" 
#include <iostream> 
#include «random» 
#include «memory.h» 
#include «limits» 
#include "Ch10 04.h" 
#include "AlignedMem.h" 


using namespace std; 


void Init(uint8 t* x, uint64 t n, unsigned int seed) 


{ 
uniform int distribution«» ui dist {0, 255}; 
default random engine rng (seed); 
for (size t i = 0; i< n; i++) 
x[i] = (uint8 t)ui dist(rng); 
) 
bool Avx2ClipPixelsCpp(ClipData* cd) 
( 


uint8 t* src - cd-»m Src; 
uint8 t* des - cd-»m Des; 
uint64 t num pixels - cd-»m NumPixels; 


364 


} 


CHAPTER 10 bn AVX2 PROGRAMMING — PACKED INTEGERS 


if (num pixels == 0 || (num pixels % 32) != 0) 
return false; 


if (!AlignedMem::IsAligned(src, 32) || !AlignedMem::IsAligned(des, 32)) 
return false; 


uint64 t num clipped pixels - 0; 
uint8 t thresh lo = cd-»m ThreshLo; 
uint8 t thresh hi - cd-»m ThreshHi; 


for (uint64 t i = 0; i « num pixels; i++) 
{ 


uint8 t pixel = src[i]; 


if (pixel « thresh lo) 


des[i] = thresh lo; 
num clipped pixels; 


else if (pixel » thresh hi) 


{ 
des[i] = thresh hi; 
num clipped pixels++; 


else 
des[i] = src[i]; 


) 


cd-»m NumClippedPixels - num clipped pixels; 
return true; 


void Avx2ClipPixels(void) 


{ 


const uint8 t thresh lo = 10; 
const uint8 t thresh hi = 245; 
const uint64 t num pixels = 4 * 1024 * 1024; 


AlignedArray«uint8 t» src(num pixels, 32); 
AlignedArray«uint8 t» desi(num pixels, 32); 
AlignedArray«uint8 t» des2(num pixels, 32); 
Init(src.Data(), num pixels, 157); 


ClipData cdi; 
ClipData cd2; 


src.Data(); 
des1.Data(); 


cdi.m Src 
cdi.m Des 
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cdi.m NumPixels = num pixels; 

cdi.m NumClippedPixels = numeric limits«uint64 t»::max(); 
cdi.m ThreshLo = thresh lo; 

cdi.m ThreshHi - thresh hi; 


cd2.m Src = src.Data(); 

cd2.m Des - des2.Data(); 

cd2.m NumPixels - num pixels; 

cd2.m NumClippedPixels - numeric limits«uint64 t»::max(); 
cd2.m ThreshLo = thresh lo; 

cd2.m ThreshHi - thresh hi; 


Avx2ClipPixelsCpp(&cd1); 
Avx2ClipPixels (&cd2); 


cout << "\nResults for Avx2ClipPixels Wn"; 
cout << " cdi.m NumClippedPixelsi: " << cdi.m NumClippedPixels << '\n'; 
cout «« cd2.m NumClippedPixels2: " << cd2.m NumClippedPixels << '\n'; 


if (cd1.m NumClippedPixels !- cd2.m NumClippedPixels) 
cout << " NumClippedPixels compare error\n"; 


if (memcmp(desi.Data(), des2.Data(), num pixels) == 0) 


cout << " Pixel buffer memory compare passed\n"; 
else 
cout << " Pixel buffer memory compare passed\n"; 
) 
int main(void) 
{ 
Avx2ClipPixels(); 
Avx2ClipPixels BM(); 
return 0; 
) 
—— ——À———Á—Á—ÁÁ—— 
3 Ch10 04.asm 


3 


; The following structure must match the structure that's declared in the file .h file 


ClipData struct 

Src qword ? ;source buffer pointer 

Des qword ? ;destination buffer pointer 
NumPixels qword ? ;number of pixels 
NumClippedPixels qword ? ;number of clipped pixels 
ThreshLo byte ? ;low threshold 

ThreshHi byte ? ;high threshold 

ClipData ends 


; extern "C" bool Avx2ClipPixels (ClipData* cd) 
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.code 


Avx2ClipPixels proc 


; Load and validate arguments 


XOI eax,eax 
xor r8d,r8d 


mov rdx, [rox«ClipData.NumPixels] 
or rdx,rdx 

jz Done 

test rdx,1fh 

jnz Done 


mov r10, [rox«ClipData.Src] 
test r10,1fh 
jnz Done 


mov r11, [rox«ClipData.Des] 
test r11,1fh 
jnz Done 
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;set error return code 
;18 - number of clipped pixels 


;rdx - num pixels 

;jump of num pixels is zero 
;junp if num pixels % 32 !- 0 
;110 = Src 

;jump if Src is misaligned 
;111 = Des 


;jump if Des is misaligned 


; Create packed thresh lo and thresh hi data values 


vpbroadcastb ymm4,[rcx«ClipData.ThreshLo]  ;ymm4 
vpbroadcastb ymm5,[rcx«ClipData.ThreshHi]  ;ymm5 


; Clip pixels to threshold values 


QQ: 


; Count 


vmovdqa ymmO,ymmword ptr [r10] 
vpmaxub ymm1, ymmo, ymm4 
vpminub ymm2,ymm1, ymm5 
vmovdqa ymmword ptr [x11], ymm2 


number of clipped pixels 
vpcmpeqb ymm3, ymm2, ymmo 
vpmovmskb eax, ymm3 

not eax 

popcnt eax, eax 

add r8,rax 


; Update pointers and loop counter 


add r10,32 
add r11,32 
sub rdx,32 
jnz @B 


mov eax,1 


; Save num clipped pixels 


packed thresh lo 
packed thresh hi 


;ymmO = 32 pixels 
;clip to thresh lo 
;clip to thresh hi 
;save clipped pixels 


;compare clipped pixels to original 
;eax = mask of non-clipped pixels 
;eax - mask of clipped pixels 

;eax - number of clipped pixels 
;update clipped pixel count 


;update Src ptr 
;update Des ptr 
;update loop counter 
;repeat if not done 


;set success return code 
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Done: mov [rcx+ClipData.NumClippedPixels],r8 ;save num clipped pixels 
vzeroupper 
ret 


Avx2ClipPixels  endp 
end 


The C++ code begins with declaration of a structure named ClipData. This structure and its assembly 
language equivalent are used to maintain the pixel-clipping algorithm's data. Following the function 
declarations in the header file Ch10 04.h is the definition of a C++ function named Init. This function 
initializes the elements of a uint8 t array using random values, which simulates the pixel values of a 
grayscale image. The function Avx2ClipPixelCpp is a C++ implementation of the pixel clipping algorithm. 
This function starts by validating num pixels for correct size and divisibility by 32. Restricting the algorithm 
to images that contain an even multiple of 32 pixels is not as inflexible as it might appear. Most digital 
camera images are sized using multiples of 64 pixels due to the processing requirements of the JPEG 
compression algorithms. Following validation of num pixels, the source and destination pixel buffers are 
checked for proper alignment. 

The procedure used in Avx2ClipPixelCpp to perform pixel clipping is straightforward. A simple 
for loop examines each pixel element in the source image buffer. If a source image pixel buffer intensity 
value found to be below thresh lo or above thresh_hi, the corresponding threshold limit is saved in the 
destination buffer. Source image pixels whose intensity values lie between the two threshold limits are 
copied to the destination pixel buffer unaltered. The processing loop in Avx2ClipPixelCpp also counts the 
number of clipped pixels for comparison purposes with the assembly language version of the algorithm. 

Function Avx2ClipPixels exploits the C++ template class AlignedArray to allocate and manage the 
required image pixel buffers (see Chapter 7 for a description of this class). Following source image pixel 
buffer initialization, Avx2ClipPixels primes two instances of ClipData (cd1 and cd2) for use by the pixel 
clipping functions Avx2ClipPixelsCpp and Avx2ClipPixels .Itthen invokes these functions and compares 
the results for any discrepancies. 

Toward the top of the assembly language code is the declaration for data structure ClipPixel, which 
is semantically equivalent to its C++ counterpart. The function Avx2ClipPixels begins its execution 
by validating num pixels for size and divisibility by 32. It then checks the source and destination pixels 
buffers for proper alignment. Following argument validation, Avx2ClipPixels employs two vpbroadcastb 
instructions to create packed versions of the threshold limit values thresh lo and thresh hi in registers 
YMMA and YMM5, respectively. During each processing loop iteration, the vmovdqa ymmo, ymmword ptr 
[r10] instruction loads 32 pixel values from the source image pixel buffer into register YMMO. The ensuing 
vpmaxub ymmi,ymmo,ymm4 instruction clips the pixel values in YMMO to thresh lo. This is followed by a 
vpminub ymm2,ymmi,ymm5 instruction that clips the pixel values to thresh hi. The vmovdqa ymmword ptr 
[r11], ymm2 instruction then saves the clipped pixel intensity values to the destination image pixel buffer. 

Avx2ClipPixels counts the number of clipped pixels using a straightforward sequence of instructions. 
The vpcmpeqb ymm3, ymm2 , ymmo instruction compares the original pixel values in YMM0 to the clipped 
pixel values in YMM2 for equality. Each byte element in YMMG is set to Oxf if the original and clipped 
pixel intensity values are equal; otherwise, the YMM3 byte element is set to 0x00. The vpmovmskb eax, ymm3 
instruction that follows creates a mask of the most significant bit of each byte element in YMM3 and saves 
this mask to register EAX. More specifically, this instruction computes eax[i] = ymm3[i*8+7] for i=0, 
1,2, ... 31, which means that each 1 bit in register EAX signifies a non-clipped pixel. The ensuing not eax 
instruction converts the bit pattern in EAX to a mask of clipped pixels, and the popcnt eax, eax instruction 
counts the number of 1 bits in EAX. This count value, which corresponds to the number of clipped pixels in 
YMM2, is then added to the total number of clipped pixels in register R8. The processing loop repeats until 
all pixels have been processed. Here are the results for source code example Ch10 04: 
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Results for Avx2ClipPixels 
cd1.m NumClippedPixels1: 328090 
cd2.m NumClippedPixels2: 328090 
Pixel buffer memory compare passed 


Running benchmark function Avx2ClipPixels BM - please wait 
Benchmark times save to file Ch10 04 Avx2ClipPixels BM CHROMIUM. csv 


Table 10-1 shows the benchmark timing measurements for the pixel clipping functions 
Avx2ClipPixelsCpp and Avx2ClipPixels . 


Table 10-1. Mean Execution Times (Microseconds) for Pixel Clipping Functions (Image Buffer Size = 8 MB) 


CPU Avx2ClipPixelsCpp Avx2ClipPixels 
i7-4790S 13005 1078 

i9-7900X 11617 719 

i7-8700K 11252 644 


RGB Pixel Min-Max Values 


Listing 10-5 shows the C++ and assembly language source code for example Ch10_05, which illustrates how 
to calculate the minimum and maximum pixel intensity values in an RGB image. This example also explains 
how to exploit some of MASM’s advanced macro processing capabilities. 


Listing 10-5. Example Ch10_05 


#include "stdafx.h" 
#include <cstdint> 
#include <iostream> 
#include <iomanip> 
#include «random» 
#include "AlignedMem.h" 


using namespace std; 


extern "C" bool Avx2CalcRgbMinMax (uint8 t* rgb[3], size t num pixels, uint8 t min vals[3], 
uint8 t max vals[3]); 


void Init(uint8 t* rgb[3], size t n, unsigned int seed) 
( 
uniform int distribution«» ui dist (5, 250}; 
default random engine rng {seed}; 
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for (size t i = 0; i< n; i++) 


{ 
rgb[o][i] = (uint8 t)ui dist(rng); 
rgb[1][i] = (uint8 t)ui dist(rng); 
rgb[2][i] = (uint8 t)ui dist(rng); 
) 


// Set known min & max values for validation purposes 
rgb[o][n / 4] = 4; xgb[1][In / 2] = 1; rgb[2][3 * n / 4] = 3; 
rgb[o][n / 3] = 254; rgb[1][2 * n / 5] = 251; rgb[2][n - 1] = 252; 

} 


bool Avx2CalcRgbMinMaxCpp(uint8_t* rgb[3], size_t num pixels, uint8_t min_vals[3], uint8_t 
max vals[3]) 
{ 


// Make sure num pixels is valid 
if ((num pixels == 0) || (num pixels % 32 !- 0)) 
return false; 


if (lAlignedMem::IsAligned(rgb[o], 32)) 
return false; 

if (!AlignedMem::IsAligned(rgb[1], 32)) 
return false; 

if (!AlignedMem::IsAligned(rgb[2], 32)) 
return false; 


// Find the min and max of each color plane 
min vals[O0] = min vals[1] = min vals[2] = 255; 
max vals[0] = max vals[1] = max vals[2] = 0; 


for (size t i = 0; i < 3; i++) 


{ 
for (size t j = 0; j < num pixels; j++) 
if (rgb[i][j] < min vals[i]) 
min vals[i] = rgb[i][j]; 
else if Creol JE] > max_vals[i]) 
max_vals[i] = rgb[i][j]; 
} 
} 


return true; 


} 


int main(void) 
{ 
const size t n = 1024; 
uint8 t* rgb[3]; 
uint8 t min vals1[3], max vals1[3]; 
uint8 t min vals2[3], max vals2[3]; 
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AlignedArray«uint8 t» r(n, 32); 
AlignedArray«uint8 t» g(n, 32); 
AlignedArray«uint8 t» b(n, 32); 


r.Data(); 
g.Data(); 
b.Data(); 


Init(rgb, n, 219); 
Avx2CalcRgbMinMaxCpp(rgb, n, min valsi, max vals1); 
Avx2CalcRgbMinMax (rgb, n, min vals2, max vals2); 


cout 
cout 
cout 


cout 
cout 
cout 
cout 
cout 
cout 
cout 
cout 


cout 
cout 
cout 
cout 
cout 
cout 
cout 
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CHAPTER 10 


"Results for Avx2CalcRgbMinMax\n\n"; 


"min valsi: 


setw(4) «« 
setw(4) «« 
setw(4) «« 


"min vals2: 


setw(4) «« 
setw(4) «« 
setw(4) «« 


"max valsi: 


setw(4) «« 
setw(4) «« 
setw(4) «« 


"max vals2: 


setw(4) «« 
setw(4) «« 
setw(4) «« 


return 0; 


R G 


(int)min valsi[0] 
(int)min valsi[1] 
(int)min valsi[2] 
(int)min vals2[0] 
(int)min vals2[1] 
(int)min vals2[2] 


; 
(int)max valsi[0] 


(int)max valsi[1] 
(int)max_vals1[2] 


(int)max vals2[0] 
(int)max vals2[1] 
(int)max vals2[2] 


N e 


include <MacrosX86-64-AVX.asmh> 


; 256-bit wide constants 
segment readonly align(32) 'const' 
InitialPminVal db 32 dup(offh) 
InitialPmaxVal db 32 dup(ooh) 


ConstVals 


ConstVals 


; Macro YmmVpextrMinub 


3 


ends 


Bn"; 


<< 
<< 
<< 


<< 
<< 
<< 


<< 
<< 
<< 


<< 
<< 
<< 
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; This macro generates code that extracts the smallest unsigned byte from register YmmSrc. 
_YmmVpextrMinub macro GprDes,YmmSrc,YmmTmp 


; Make sure YmmSrc and YmmTmp are different 
.erridni «YmmSrc», <YmmTmp>, «Invalid registers» 


; Construct text strings for the corresponding XMM registers 
YmmSrcSuffix SUBSTR «YmmSrc»,2 
XmmSrc CATSTR «X»,YmmSrcSuffix 


YmmTmpSuffix SUBSTR «YmmTmp»,2 
XmmTmp CATSTR «X», YmmTmpSuffix 


; Reduce the 32 byte values in YmmSrc to the smallest value 
vextracti128 XmmTmp,YmmSrc,1 
vpminub XmmSrc, XmmSrc , XmmTmp ;XmmSrc = final 16 min values 


vpsrldq XmmTmp,XmmSrc,8 
vpminub XmmSrc,XmmSrc,XmmTmp ;XmmSrc - final 8 min values 


vpsrldq XmmTmp,XmmSrc,4 
vpminub XmmSrc, XmmSrc , XmmTmp ;XmmSrc = final 4 min values 


vpsrldq XmmTmp,XmmSrc, 2 
vpminub XmmSrc,XmmSrc,XmmTmp ;XmmSrc 


final 2 min values 


vpsrldq XmmTmp,XmmSrc,1 


vpminub XmmSrc, XmmSrc , XmmTmp ;XmmSrc = final 1 min value 
vpextrb GprDes,XmmSrc,0 ;mov final min value to Gpr 
endm 


3 Macro _YmmVpextrMaxub 
D 
; This macro generates code that extracts the largest unsigned byte from register YmmSrc. 


_YmmVpextrMaxub macro GprDes,YmmSrc,YmmTmp 


; Make sure YmmSrc and YmmTmp are different 
.erridni «YmmSrc», «YmmTmp», «Invalid registers» 


; Construct text strings for the corresponding XMM registers 
YmmSrcSuffix SUBSTR «YmmSrc»,2 
XmmSrc CATSTR «X»,YmmSrcSuffix 


YmmTmpSuffix SUBSTR <YmmTmp>, 2 
XmmTmp CATSTR «X», YmmTmpSuffix 


; Reduce the 32 byte values in YmmSrc to the largest value 
vextracti128 XmmTmp, YmmSrc,1 
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vpmaxub XmmSrc,XmmSrc,XmmTmp 


vpsrldq 
vpmaxub 


XmmTmp , XmmSrc , 8 
XmmSrc , XmmSxrc , XmmTmp 


vpsrldq 
Vpmaxub 


XmmTmp , XmmSrc , 4 
XmmSrc,XmmSrc,XmmTmp 


vpsrldq 
vpmaxub 


XmmTmp , XmmSrc , 2 
XmmSrc,XmmSrc,XmmTmp 


vpsrldq 
vpmaxub 


XmmTmp , XmmSrc , 1 
XmmSrc,XmmSrc,XmmTmp 


vpextrb 
endm 


GprDes,XmmSrc,O 
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;XmmSrc - final 16 max values 
;XmmSrc - final 8 max values 
;XmmSrc - final 4 max values 
;XmmSrc - final 2 max values 
;XmmSrc - final 1 max value 


;mov final max value to Gpr 


; extern "C" bool Avx2CalcRgbMinMax (uint8 t* rgb[3], size t num pixels, uint8 t min 


vals[3], uint8 t max vals[3]) 


.code 

Avx2CalcRgbMinMax proc frame 
_CreateFrame CalcMinMax ,0,48,r12 
_SaveXmmRegs xmm6,xmm7,xmm8 
_EndProlog 


; Make sure num pixels and the color plane arrays are valid 


XOI eax,eax 


test rdx,rdx 
jz Done 
test rdx,01fh 
jnz Done 


mov r10, [rcx] 
test r10,1fh 
jnz Done 


mov r11, [rcx«8] 
test r11,1fh 
jnz Done 


mov r12,[rcx+16] 
test r12,1fh 


jnz Done 


; Initialize the processing loop registers 


vmovdqa ymm3,ymmword ptr [InitialPminVal] 


vmovdqa ymm4,ymm3 
vmovdqa ymm5,ymm3 


;set error code 


;jump if num pixels == 0 

;jump if num pixels % 32 !- 0 

;110 - color plane R 

;jump if color plane R is not aligned 
5111 = color plane G 

;jump if color plane G is not aligned 
;112 = 


color plane B 


;jump if color plane B is not aligned 


3ymm3 = R minimums 
;ymm4 = G minimums 
3ymm5 = B minimums 
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vmovdqa ymm6,ymmword ptr [InitialPmaxVal] 


vmovdqa ymm7,ymmé 
vmovdqa ymm8,ymm6ó 


XOI ICX,ICX 


;ymm6 = R maximums 
3ymm7 = G maximums 
;ymm8 = B maximums 


;YCX = common array offset 


; Scan RGB color plane arrays for packed minimums 


align 16 

@@: vmovdqa ymmO,ymmword ptr [r10+rcx] 
vmovdqa ymm1,ymmword ptr [r11+rcx] 
vmovdqa ymm2,ymmword ptr [r12+rcx] 


vpminub ymm3, ymm3 , ymmo 
vpminub ymm4, ymm4, ymm1 
vpminub ymm5,ymm5, ymm2 


vpmaxub ymm6, ymm6 , ymmO 
vpmaxub ymm7, ymm7 , ymm1 
vpmaxub ymm8,ymm8,ymm2 


add rcx,32 
sub rdx,32 
jnz GB 


; Calculate the final RGB minimum values 
_YmmVpextrMinub rax, ymm3, ymmo 
mov byte ptr [r8],al 
_YmmVpextrMinub rax, ymm4, ymmo 
mov byte ptr [r8+1],al 
_YmmVpextrMinub rax,ymm5, ymmo 
mov byte ptr [r8+2],al 


; Calculate the final RGB maximum values 
YmmVpextrMaxub rax, ymm6, ymm1 
mov byte ptr [r9],al 
_YmmVpextrMaxub rax,ymm7, ymm1 
mov byte ptr [r9+1],al 
_YmmVpextrMaxub rax,ymm8, ymm1 
mov byte ptr [r9+2],al 


mov eax,1 


Done: vzeroupper 
 RestoreXmmRegs xmm6, xmm7 , xmm8 
_DeleteFrame r12 
ret 
Avx2CalcRgbMinMax_ endp 
end 
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; ymmo 
symm 
;ymm2 


and 


now ou 
Won 2 


;update R 
;update G 
;update B 


;update R 
;update G 
;update B 


;save 


;save 


;save 


;save 


;save 


;save 


min 


min 


min 


max 


max 


max 


maximums 


pixels 
pixels 
pixels 


minimums 
minimums 
minimums 


maximums 


maximums 
maximums 


B 


;set success return code 
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The function Avx2CalcRgbMinMaxCpp that's shown in Listing 10-5 is a C++ implementation of the 
RGB min-max algorithm. This function employs a set of nested for loops to determine the minimum and 
maximum pixel intensity values for each color plane. These values are maintained in the arrays min vals 
andmax vals. The function main uses the C++ template class AlignedArray to allocate three arrays that 
simulate the color plane buffers of an RGB image. These buffers are loaded with random values by the 
function Init. Note that function Init assigns known values to several elements in each color plane buffer. 
These known values are used to verify correct execution of both the C++ and assembly language min-max 
functions. 

Toward the top of the assembly language code is a custom constant segment named ConstVals that 
defines packed versions of the initial pixel minimum and maximum values. A custom segment is used here 
to ensure alignment of the 256-bit wide packed values on a 32-byte boundary, as explained in Chapter 9. The 
macro definitions YmmVpextrMinub and YmmVpextrMaxub are next. These macros contain instructions that 
extract the smallest and largest byte values from a YMM register. The inner workings of these macros will be 
explained shortly. 

The function Avx2CalcRgbMinMax uses registers YMM3-YMM5 and YMM6-YMMB8 to maintain the RGB 
minimum and maximum values, respectively. During each iteration of the main processing loop, a series of 
vpminub and vpmaxub instructions update the current RGB minimums and maximums. Upon completion 
ofthe main processing loop, the aforementioned YMM registers contain 32 minimum and maximum pixel 
intensity values for each color component. The YnmVpextrMinub and YmmVpextrMaxub macros are then 
used to extract the final RGB minimum and maximum pixel values. These values are then saved to the result 
arraysmin vals and max vals, respectively. 

The macros definitions YmmVpextrMinub and YmmVpextrMaxub are identical, except for the 
instructions vpminub and vpmaxub. In the text that follows, all explanatory comments made about _ 
YmmVpextrMinub also apply to _YmmVpextrMaxub. The YmmVpextrMinub macro requires three parameters: 

a destination general-purpose register (GprDes), a source YMM register (YmmSrc), and a temporary YMM 
register (YmmTmp). Note that macro parameters YmmSrc and YmmTmp must be different registers. If they're the 
same, the . exridni directive (Error if Text Items are Identical, Case Insensitive) generates an error message 
during assembly. MASM also supports several other conditional error directives besides .erridni, and these 
are described in the Visual Studio documentation. 

In order to generate the correct assembly language code, the macro. YmmVpextrMinub requires an 
XMM register text string (XmmSrc) that corresponds to the low-order portion of the specified YmmSrc register. 
For example, if YnmSrc equals YMMO, then XmmSrc must equal XMMO. The MASM directives substr 
(Return Substring of Text Item) and catstr (Concatenate Text Items) are used to initialize XmmSrc. The 
statement YnmSrcSuffix SUBSTR «YmmSrc»,2 assigns a text string value to YnmSrcSuffix that excludes the 
leading character of macro parameter YmmSrc. For example, if YmmSrc equals YMMO, then YmmSrcSuffix 
equals MMO. The next statement, XmmSrc CATSTR «X»,YmmSrcSuffix, adds a leading X to the value of 
YmmSrcSuffix and assigns it to XmmSrc. Continuing with the earlier example, this means that the text string 
XMMD is assigned to XmmSrc. The SUBSTR and CATSTR directives are then used to assign a text string value to 
XmmTmp. 

Following initialization of the required macro text strings are the instructions that extract the smallest 
byte value from the specified YMM register. The vextracti128 XmmTmp, YmmSrc, 1 instruction copies the 
high-order 16 bytes of register YnmSrc to XmmTmp. (The vextracti128 instruction also supports using an 
immediate operand of 0 to copy the low-order 16 bytes.) A vpminub XmmSrc,XmmSrc,XmmTmp instruction 
loads the final 16 minimum values into XmmSrc. The vpsrldq XmmTmp,XmmSrc, 8 instruction shifts a copy of 
the value that's in XmmSrc to the right by eight bytes and saves the result to XnmTmp. This facilitates the use of 
another vpminub instruction that reduces the number of minimum byte values from 16 to 8. Repeated sets 
ofthe vpsrldq and vpminub instructions are then employed until the final minimum value resides in the 
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low-order byte of XmmSrc. A vpextrb GprDes,XmmSrc,O instruction copies the final minimum value to the 
specified general-purpose register. Here are the results for source code example Ch10 05: 


Results for Avx2CalcRgbMinMax 


R G B 
min_vals1: 4 1 3 
min_vals2: 4 1 3 


max_vals1: 254 251 252 
max_vals2: 254 251 252 


RGB to Grayscale Conversion 


The final source code example of this chapter, Ch10_06, explains how to perform an RGB to grayscale image 
conversion. This example intermixes the packed integer capabilities of AVX2 that you have learned in this 
chapter with the packed floating-point techniques presented in Chapter 9. Listing 10-6 shows the source 
code for example Ch10 06 


Listing 10-6. Example Ch10 06 


ff esse seen 
// ImageMatrix.h 
fps nee 
struct RGB32 
{ 

uint8 t m R; 

uint8 t m G; 

uint8 t m B; 

uint8 t m A; 
n 
fasses see ss esse eee 
// Ch10 06.cpp 
ae 


#include "stdafx.h" 
#include <iostream> 
#include <stdexcept> 
#include "Ch10 06.h" 
#include "AlignedMem.h" 
#include "ImageMatrix.h" 


using namespace std; 
// Image size limits 


extern "C" const int c NumPixelsMin = 32; 
extern "C" const int c NumPixelsMax = 256 * 1024 * 1024; 
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// RGB to grayscale conversion coefficients 
const float c Coef[4] (0.2126f, 0.7152f, 0.0722f, 0.0f]; 


bool CompareGsImages(const uint8 t* pb gsi,const uint8 t* pb gs2, int num pixels) 


{ 


for (int i = 0; i < num pixels; i++) 


if (abs((int)pb gs1[i] - (int)pb gs2[i]) > 1) 
return false; 


) 


return true; 


} 


bool Avx2ConvertRgbToGsCpp(uint8 t* pb gs, const RGB32* pb rgb, int num pixels, const float 
coef[4]) 


{ 

if (num pixels « c NumPixelsMin || num pixels > c NumPixelsMax) 
return false; 

if (num pixels % 8 !- 0) 
return false; 

if (!AlignedMem::IsAligned(pb gs, 32)) 
return false; 

if (!AlignedMem::IsAligned(pb rgb, 32)) 
return false; 

for (int i = 0; i « num pixels; i++) 

( 
uint8 t r - pb rgb[i].m R; 
uint8 t g = pb rgb[i].m G; 
uint8 t b = pb rgb[i].m B; 
float gs temp = r * coef[0] + g * coef[1] + b * coef[2] + 0.5f; 
if (gs temp « 0.0f) 

gs temp = 0.0f; 
else if (gs temp » 255.0f) 
gs temp - 255.0f; 

pb gs[i] = (uint8 t)gs temp; 

} 

return true; 

} 


void Avx2ConvertRgbToGs(void) 


{ 
const wchar t* fn rgb = L"..\\Ch10 Data\\TestImage3.bmp"; 
const wchar t* fn gs1 = L"Ch10 06 Avx2ConvertRgbToGs TestlImage3 GS1.bmp"; 
const wchar t* fn gs2 = L"Ch10 06 Avx2ConvertRgbToGs TestlImage3 GS2.bmp"; 
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int 
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ImageMatrix im rgb(fn rgb); 

int im h = im rgb.GetHeight(); 

int im w = im rgb.GetWidth(); 

int num pixels - im h * im w; 

ImageMatrix im gsi(im h, im w, PixelType::Gray8); 
ImageMatrix im gs2(im h, im w, PixelType::Gray8); 
RGB32* pb rgb = im rgb.GetPixelBuffer«RGB32»(); 
uint8 t* pb gs1 = im gs1.GetPixelBuffer<uint8 t»(); 
uint8 t* pb gs2 - im gs2.GetPixelBuffer«uint8 t»(); 


cout << "Results for Avx2ConvertRgbToGs Wn"; 
wcout << "Converting RGB image " << fn rgb << '\n'; 
cout << " " ' pixels\n"; 


im h =" << im h << 
cout << " imw =" << im w << " pixels\n"; 


// Exercise conversion functions 
bool rci = Avx2ConvertRgbToGsCpp(pb gs1, pb rgb, num pixels, c Coef); 
bool rc2 - Avx2ConvertRgbToGs (pb gs2, pb rgb, num pixels, c Coef); 


if (rci 88 rc2) 


{ 
wcout << "Saving grayscale image #1 - " << fn gs1 << '\n'; 
im gs1.SaveToBitmapFile(fn gs1); 
wcout << "Saving grayscale image #2 - " << fn gs2 << '\n'; 
im gs2.SaveToBitmapFile(fn gs2); 
if (CompareGsImages(pb gsi, pb gs2, num pixels)) 
cout << "Grayscale image compare OK Nn"; 
else 
cout << "Grayscale image compare failed\n"; 
} 
else 
cout << "Invalid return code\n"; 
main() 
try 
{ 
Avx2ConvertRgbToGs(); 
Avx2ConvertRgbToGs_BM(); 
} 
catch (runtime error& rte) 
{ 
cout << "'runtime error' exception has occurred - " << rte.what() << 
) 
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catch (...) 
{ 
cout << "Unexpected exception has occurred\n"; 
} 
return 0; 
) 
j mm 
; Ch10_06.asm 
H wee ee ew ee ew ee ee ee ee ee ee ee ee ee ee ee ee ae ee ee ee ee ee ee ee 
include <MacrosX86-64-AVX.asmh> 
.const 
GsMask dword Offffffffh, 0, 0, 0, offffffffh, 0, 0, O 
r4 Op5 real4 0.5 
14 255p0 real4 255.0 


extern c NumPixelsMin:dword 
extern c NumPixelsMax:dword 


; extern "C" bool Avx2ConvertRgbToGs (uint8 t* pb gs, const RGB32* pb rgb, int num pixels, 
const float coef[4]) 

5 

; Note: Memory pointed to by pb rgb is ordered as follows: 

D R(0,0), G(0,0), B(0,0), A(0,0), R(0,1), G(0,1), B(0,1), A(0,1), ... 


.code 
Avx2ConvertRgbToGs proc frame 
_CreateFrame RGBGS ,0,112 
 SaveXmmRegs xmm6,xmm7,xmm11,xmm12,xmm13,xmm14, xmm15 
 EndProlog 


; Validate argument values 
xor eax,eax ;set error return code 
cmp r8d,[c NumPixelsMin] 
jl Done ;jump if num pixels < min value 
cmp r8d,[c NumPixelsMax] 
jg Done ;jump if num pixels > max value 
test r8d,7 
jnz Done ;jump if (num pixels % 8) != 0 


test rcx,1fh 


jnz Done ;junp if pb gs is not aligned 
test rdx,1fh 
jnz Done ;jump if pb rgb is not aligned 


; Perform required initializations 
vbroadcastss ymm11,real4 ptr [r4 255p0] ;ymm11 
vbroadcastss ymm12,real4 ptr [r4 0p5] j ymm12 


packed 255.0 
packed 0.5 
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vpxor ymm13,ymm13, ymm13 ;ymm13 = packed zero 
vmovups xmmO,xmmword ptr [r9] 

vperm2f128 ymm14, ymmo, ymmo, 00000000b ;ymm14 = packed coef 
vmovups ymm15,ymmword ptr [GsMask] ;ymm15 = GsMask (SPFP) 


; Load next 8 RGB32 pixel values (PO - P7) 
align 16 

@@: vmovdqa ymmO,ymmword ptr [rdx] ;ymmo = 8 rgb32 pixels (P7 - PO) 

; Size-promote RGB32 color components from bytes to dwords 
vpunpcklbw ymm1, ymmo, ymm13 
vpunpckhbw ymm2, ymmo, ymm13 
vpunpcklwd ymm3, ymm1, ymm13 
vpunpckhwd ymm4, ymm1 , ymm13 
vpunpcklwd ymm5, ymm2, ymm13 
vpunpckhwd ymm6, ymm2, ymm13 


;ymm3 = P1, PO (dword) 
;ymm4 = P3, P2 (dword) 
;ymm5 = P5, P4 (dword) 
;ymm6 - P7, P6 (dword) 


; Convert color component values to single-precision floating-point 
vcvtdq2ps ymmo, ymm3 ;ymmo = P1, PO (SPFP) 
vcvtdq2ps ymm1, ymm4 ;ymm1 = P3, P2 (SPFP) 
vcvtdq2ps ymm2,ymm5 ;ymm2 = PS, P4 (SPFP) 
vcvtdq2ps ymm3,ymm6 ;ymm3 = P7, P6 (SPFP) 


; Multiply color component values by color conversion coefficients 
vmulps ymmo, ymmo, ymm14 
vmulps ymmi,ymm1,ymm14 
vmulps ymm2,ymm2,ymm14 
vmulps ymm3,ymm3,ymm14 


; Sum weighted color components for final grayscale values 
vhaddps ymm4, ymmo, ymmo 


vhaddps ymm4,ymm4, ymm4 ;ymm4[159:128] = P1, ymm4[31:0] 
vhaddps ymm5,ymm1,ymm1i 
vhaddps ymm5,ymm5, ymm5 ;ymm5[159:128] = P3, ymm4[31:0] 
vhaddps ymm6,ymm2, ymm2 
vhaddps ymm6,ymm6, ymm6 ;ymm6[159:128] - P5, ymm4[31:0] 
vhaddps ymm;7,ymm3, ymm3 
vhaddps ymm7 ,ymm7 , ymm7 ;ymm7[159:128] = P7, ymm4[31:0] 


; Merge SPFP grayscale values into a single YMM register 
;mask out unneeded SPFP values 


vandps 
vandps 
vandps 
vandps 
vpslldq 
vpslldq 
vpslldq 


ymm4 , ymm4 , ymm15 
ymm5 , ymm5 , ymm15 
ymm6 , ymm6 , ymm15 
ymm7 , ymm7 , ymm15 
ymms , ymm5 , 4 
ymm6 , ymm6 , 8 
ymm7 , ymm7 , 12 


vorps ymmo, ymm4,ymm5 
vorps ymmi,ymm6,ymm7 
vorps ymm2,ymmo,ymmi 
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;merge values 


;ymm2 = 8 GS pixel values (SPFP) 


PO 


P2 


P4 


P6 
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; Add 0.5 rounding factor and clip to 0.0 - 255.0 


vaddps ymm2,ymm2,ymm12 ;add 0.5f rounding factor 
vminps ymm3,ymm2,ymm11 ;clip pixels above 255.0 
vmaxps ymm4,ymm3,ymm13 ;clip pixels below 0.0 


; Convert SPFP values to bytes and save 


vcvtps2dq ymm3,ymm2 ;convert GS SPFP to dwords 
vpackusdw ymm4,ymm3, ymm13 ;convert GS dwords to words 
vpackuswb ymm5,ymm4, ymm13 ;convert GS words to bytes 
vperm2i128 ymm6, ymm13, ymm5 , 3 ;xmm5 = GS P3:PO, xmm6 = GS P7:P4 
vmovd dword ptr [rcx],xmm5 ;save P3 - PO 

vmovd dword ptr [rcx+4],xmm6 ;save P7 - P4 

add rdx,32 ;update pb rgb to next block 
add rcx,8 ;update pb gs to next block 
sub r8d,8 ;num pixels -= 8 

jnz GB ;repeat until done 

mov eax,1 ;set success return code 


Done: vzeroupper 
 RestoreXmmRegs xmm6,xmm7,xmm11,xmm12,xmm13,xmm14,xmm15 
_DeleteFrame 
ret 
Avx2ConvertRgbToGs_ endp 
end 


A variety of algorithms exist to convert an RGB image into a grayscale image. One frequently-used 
technique calculates grayscale pixel values using a weighted of sum the RGB color components. In this 
source code example, RGB pixels are converted to grayscale pixels using the following equation: 


GS (x.y) = R(x,y)W, + G(xy)W, + B(x,y)W, 


Each RGB color component weight (or coefficient) is a floating-point number between 0.0 and 1.0, 
and the sum of the three component coefficients normally equals 1.0. The exact values used for the color 
component coefficients are usually based on published standards that reflect a multitude of visual factors 
including properties of the target color space, display device characteristics, and perceived image quality. 
If you're interested in learning more about RGB to grayscale image conversion, Appendix A contains some 
references that you can consult. 

Source code Ch10 06 opens with the structure declaration RGB32. This structure is declared in the 
header file ImageMatrix.h and specifies the color component ordering scheme of each RGB pixel. The 
function Avx2ConvertRgbToGsCpp contains a C++ implementation of the RGB to grayscale conversion 
algorithm. This function uses an ordinary for loop that sweeps through the RGB32 image buffer pb rgb and 
computes grayscale pixel values using the aforementioned conversion equation. Note that RGB32 element 
m Ais not used in any ofthe calculations in this example. Each calculated grayscale pixel value is adjusted by 
a rounding factor and clipped to [0.0, 255.0] before it is saved to the grayscale image buffer pointed to by 


pb gs. 
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The assembly language code begins with a .const section that defines the necessary constants. 
Following its prolog, the function Avx2ConvertRgbToGs performs the customary image size and buffer 
alignment checks. It then loads the algorithm's required packed constants into registers YMM11-YMM15. 
Note that register YMM14 contains a packed version of the color conversion coefficients, as illustrated in 
Figure 10-3. The assembly language processing loop begins with a vnovdqa ymmo, ymmword ptr [rdx] 
instruction that loads eight RGB32 pixel values into register YMMO. The color components of these pixels are 
then size-promoted to doublewords using a series of vpunpck[1|h]bw and vpunpck[1 |h]wd instructions. The 
ensuing vcvtdq2ps instructions convert the pixel color components from doublewords to single-precision 
floating-point values. Following execution of the four vcvtdq2ps instructions, registers YMM0-YMM3 each 
contain two RGB32 pixels and each color component is a single-precision floating-point value. Figure 10-3 
also shows the RGB32 size promotions and conversions discussed in this paragraph. 


Packed color coefficients 
A B G R A B G R 
Packed RGB32 pixels 


Gau Mad Mu aa uud Gu M i MN 
Pixel color components following conversion to single-precision floating-point 


= Don't care value 


Figure 10-3. RGB32 pixel color component size promotions and conversions 


The four vmulps instructions multiply the eight RGB32 pixels by the color conversion coefficients. The 
ensuing vhaddps instructions sum the weighted color components of each pixel to generate the required 
grayscale values. Following execution of these instructions, registers YMM4-YMM7 each contain two single- 
precision floating-point grayscale pixel values, one in element position [31:0] and the another in [159:128], as 
shown in Figure 10-4. The eight grayscale values in YMM4-YMM7 are then merged into YMM2 using a series 
of vandps, vpslldq, and vorps instructions. Figure 10-4 also shows the final merged result. The vaddps, 
vminps, and vmaxps instructions that follow add in the rounding factor (0.5) and clip the grayscale pixels to 
[0.0, 255.0]. These values are then converted to unsigned bytes using the instructions vcvtps2dq, vpackusdw, 
and vpackuswb. The two vmovd instructions save the four unsigned byte pixel values in both XMM5[31:0] and 
XMM6/[31:0] to the grayscale image buffer. 


ies) 
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Grayscale pixels as single-precision floating-point values 


Merged grayscale pixels as single-precision floating-point values 


= Don't care value 


Figure 10-4. Grayscale single-precision floating-point pixel values before and after merging 


Here are the results of source code example Ch10 06: 


Results for Avx2ConvertRgbToGs 


Converting RGB image ..\Ch10 Data NTestImage3.bmp 

im h = 960 pixels 

im w - 640 pixels 
Saving grayscale image #1 - Ch10 06 Avx2ConvertRgbToGs TestlImage3 GS1.bmp 
Saving grayscale image #2 - Ch10 06 Avx2ConvertRgbToGs TestlImage3 GS2.bmp 
Grayscale image compare OK 


Running benchmark function Avx2ConvertRgbToGs BM - please wait 
Benchmark times save to file Ch10 06 Avx2ConvertRgbToGs BM CHROMIUM.csv 


Table 10-2 shows the benchmark timing measurements for the RGB to grayscale image conversion 
functions Avx2ConvertRgbToGsCpp and Avx2ConvertRgbToGs . The performance gains of this source code 
example are modest compared to some ofthe other examples in this book. The reason for this is that the 
RGB32 color components in the source image buffer are interleaved with each other, which necessitates the 
use of slower horizontal arithmetic. Rearranging the RGB32 data so that the pixels of each color component 
reside in separate image buffers often results in significantly faster performance. You see an example of this 
in Chapter 14. 
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Table 10-2. Mean Execution Times (Microseconds) for RGB to Grayscale Image Conversion Using 
TestImage3.bmp 


CPU Avx2ConvertRgbToGsCpp Avx2ConvertRgbToGs_ 
i7-4790S 1504 843 
i9-7900X 1075 593 
i7-8700K 1031 565 


Summary 


Here are the key learning points of Chapter 10: 
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AVX2 extends the packed integer capabilities of AVX. Most x86-AVX packed integer 
instructions can be used with either 128-bit or 256-bit wide operands. These 
operands should always be properly aligned whenever possible. 


Similar to x86-AVX floating-point, assembly language functions that perform packed 
integer calculations using a YMM register should use a vzeroupper instruction prior 
any epilog code or the ret instruction. This avoids potential performance delays that 
can occur when the processor transitions from executing x86-AVX instructions to 
x86-SSE instructions. 


The Visual C++ calling convention differs for assembly language functions that 
return a structure by value. A function that returns a structure by value must copy 

a large structure (one greater than eight bytes) to the buffer pointed to by the RCX 
register. The normal calling convention registers are also “right-shifted” as explained 
in this chapter. 


Assembly language functions can use the vpunpck1[bw|wd|dq] and 
vpunpckh[bw|wd | dq] instructions to unpack 128-bit or 256-bit wide integer 
operands. 


Assembly language functions can use the vpackss[dw|wb] and vpackus [dw |wb] 
instructions to pack 128-bit or 256-bit wide integer operands using signed or 
unsigned saturation. 


Assembly language functions can use the vmovzx[bw|bd | bq|wd | uq | dq] and 
vmovsx[bw |bd |bq |wd | wq | dq] instructions to perform zero or sign extended packed 
integer size promotions. 


MASM supports directives that can perform rudimentary string processing 
operations, which can be employed to construct text strings for macro instruction 
mnemonics, operands, and labels. MASM also supports conditional error directives 
that can be used to signal error conditions during source code assembly. 


CHAPTER 11 


AVX2 Programming - Extended 
Instructions 


In this chapter, you learn how to use some ofthe instruction set extensions that were introduced in 
Chapter 8. The first section contains a couple of source code examples that exemplify use of the scalar and 
packed fused-multiply-add (FMA) instructions. The second section covers instructions that involve the 
general-purpose registers. This section includes source code examples that explain flagless multiplication 
and bit shifting. It also surveys some of the enhanced bit-manipulation instructions. The final section 
discusses the instructions that perform half-precision floating-point conversions. 

The source code examples in the first two sections of this chapter execute correctly on most processors 
from AMD and Intel that that support AVX2. The half-precision floating-point source code example works 
on AMD and Intel processors that support AVX and the F16C instruction set extension. As a reminder, you 
should never assume that a specific instruction set extension is available based on whether the processor 
supports AVX or AVX2. Production code should always test for a specific instruction set extension using the 
cupid instruction. You learn how to do this in Chapter 16. 


FMA Programming 


A FMA calculation performs a floating-point multiplication followed by a floating-point addition using a 
single rounding operation. Chapter 8 introduced FMA operations and discusses the particulars in greater 
detail. In this section, you learn how to use FMA instructions to implement discrete convolution functions. 
The section begins with a brief overview of convolution mathematics. The purpose of this overview is to 
explain just enough theory to understand the source code examples. This is followed by a source code 
example that implements a practical discrete convolution function using scalar FMA instructions. The 
section concludes with a source code example that exploits packed FMA instructions to accelerate the 
performance of a function that performs discrete convolutions. 


Convolutions 


Convolution is a mathematical operation that blends an input signal with a response signal to produce an 
output signal. Formally, the convolution of input signal f and response signal g is defined as follows: 


h(t)= [re —t)g(t)dt 


where h represents the output signal. The notation f* gis commonly used to denote the convolution of 
signals (or functions) fand g. 
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Convolutions are used extensively in a wide variety of scientific and engineering applications. Many 
signal processing and image processing techniques are based on convolution theory. In these domains, 
discrete arrays of sampled data points represent the input, response, and output signals. A discrete 
convolution can be calculated using the following equation: 


iti- À rti- del 


where i= 0,1,...,N 1 and M = UN / 2 |. In the preceding equations, N denotes the number of elements in 
both the input and output signal arrays and N, symbolizes the size of the response signal array. All of the 
explanations and source code examples in this section assume that N, is an odd integer greater than or 
equal to three. If you examine the discrete convolution equation carefully, you will notice that each element 
in output signal array h is computed using a relatively uncomplicated sum-of-products calculation that 
encompasses elements of input signal array f and response signal array g. These types of calculations are 
easy to implement using FMA instructions. 

In digital signal processing, many applications use smoothing operators to reduce the amount of noise 
that's present in a raw signal. For example, the top plot in Figure 11-1 shows a raw data signal that contains 
a fair amount of noise. The bottom plot in Figure 11-1 shows the same signal following the application of a 
smoothing operator. In this instance, the smoothing operator convolved the original raw signal with a set of 
discrete coefficients that approximate a Gaussian (or low-pass) filter. These coefficients correspond to the 
response signal array g that's incorporated in the discrete convolution equation. The response signal array is 
often called a convolution kernel or convolution mask. 
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Figure 11-1. Raw data signal (top plot) and its smoothed counterpart (bottom plot) 


The discrete convolution equation can be implemented in source code using a couple of nested for 
loops. During each outer loop iteration, the convolution kernel center point g[0] is superimposed over the 
current input signal array element f [i]. The inner loop calculates the intermediate products, as shown 
in Figure 11-2. These intermediate products are then summed and saved to output signal array element 
h[i], which is also shown in Figure 11-2. The FMA source code examples that are presented in this section 
implement convolution functions using the techniques described in this paragraph. 
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f[i-3] f[i-2] f[i-1] f[i] f[i*1] f[i+2] f[i+3] 


g[2] g[1] g[0] gf-1] ef-2] 


Convolution Kernel 


Intermediate Products 


Output Signal 


hfi] 
Figure 11-2. Application of a smoothing operator with an input signal element 


The purpose of the preceding overview was to provide just enough background math to understand 
the source code examples. Numerous tomes have been published that explain convolution and signal 
processing theory in significantly greater detail. Appendix A contains a list of introductory references that 
you can consult for additional information about convolution and signal processing theory. 


Scalar FMA 


Source code example Ch11_01 explains how to implement a one-dimensional discrete convolution function 
using scalar FMA instructions. It also elucidates the performance benefits of convolution functions that 

use fixed-size versus variable-sized convolution kernels. Listing 11-1 shows the source code for example 
Ch11_01. 


Listing 11-1. Example Ch11_01 


#pragma once 


// Ch11 01 Misc.cpp 

extern void CreateSignal(float* x, int n, int kernel size, unsigned int seed); 
extern void PadSignal(float* x2, int n2, const float* x1, int n1, int ks2); 
extern unsigned int g RngSeedVal; 


// Ch11 O1.cpp 

extern bool Convolve1Cpp(float* y, const float* x, int num pts, const float* kernel, int 
kernel size); 

extern bool ConvolveiKs5Cpp(float* y, const float* x, int num pts, const float* kernel, int 
kernel size); 
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// Ch11 01 .asm 

extern "C" bool Convolvei (float* y, const float* x, int num pts, const float* kernel, int 
kernel size); 

extern "C" bool ConvolveiKs5 (float* y, const float* x, int num pts, const float* kernel, 
int kernel size); 


// Ch11 01 BM.cpp 
extern void Convolve1 BM(void); 


#include "stdafx.h" 
include <iostream> 
include «random» 

#define USE MATH DEFINES 
#include «math.h» 
#include "Chii 01.h" 


using namespace std; 


void CreateSignal(float* x, int n, int kernel size, unsigned int seed) 
1 

const float degtorad - (float)(M PI / 180.0); 

const float t start - 0; 

const float t step - 0.002f; 

const int m = 3; 

const float amp[m] {1.0f, 0.80f, 1.20f}; 

const float freq[m] (5.0f, 10.0f, 15.0f}; 

const float phase[m] (0.0f, 45.0f, 90.0f}; 

const int ks2 - kernel size / 2; 


uniform int distribution«» ui dist (0, 500}; 
default random engine rng {seed}; 
float t = t start; 


for (int i = 0; i« n; i++, t += t step) 
{ 


float x val = 0; 


for (int j = 0; j < m j++) 


{ 
float omega = 2.0f * (float)M PI * freq[j]; 
float x tempi = amp[j] * sin(omega * t + phase[j] * degtorad); 
int rand val - ui dist(rng); 
float noise - (float)((rand val) - 250) / 10.0f; 
float x temp2 = x tempi + x tempi * noise / 100.0f; 
x val += x temp2; 
} 
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x[i] = x val; 


extern void PadSignal(float* x2, int n2, const float* x1, int n1, int ks2) 


{ 
if (n2 != n1 + ks2 * 2) 
throw runtime error("InitPad - invalid size argument"); 


for (int i = 0; i < n1; i++) 
x2[i + ks2] = x1[il; 


for (int i = 0; i < ks2; i++) 


{ 
x2[i] = x1[ks2 - i - 1]; 
x2[n1 + ks2 + i] = x1[n1 - i - 1]; 
} 
} 
[l------------------------------------------------ 
// Ch11_01.cpp 
[l------------------------------------------------ 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 
#include <memory> 
#include <fstream> 
#include <stdexcept> 
#include "Ch11_01.h" 


using namespace std; 


extern "C" const int c_NumPtsMin = 32; 

extern "C" const int c_NumPtsMax = 16 * 1024 * 1024; 
extern "C" const int c KernelSizeMin = 3; 

extern "C" const int c KernelSizeMax - 15; 

unsigned int g RngSeedVal - 97; 


void Convolvei(void) 
( 
const int n1 = 512; 
const float kernel[] ( 0.0625f, 0.25f, 0.375f, 0.25f, 0.0625f }; 
const int ks - sizeof(kernel) / sizeof(float); 
const int ks2 = ks / 2; 
const int n2 = n1 + ks2 * 2; 


// Create signal array 

unique ptr«float[]» x1 up (new float[n1]}; 
unique ptr«float[]» x2 up {new float[n2]}; 
float* x1 = x1 up.get(); 
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float* x2 = x2 up.get(); 


CreateSignal(x1, n1, ks, g RngSeedVal); 
PadSignal(x2, n2, x1, n1, ks2); 


// Perform convolutions 


const int num pts 
unique ptr«float[ 
unique ptr«float[ 
unique ptr«float[ 
unique ptr«float[ 


float* y2 
float* y3 
float* y4 


bool rci = ConvolveiCpp(yi, x2, num pts, kernel, ks); 


L3 vL uaua 


= ni; 
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> y1 up (new float[num pts]}; 


> y3 up (new float[num pts]}; 


[ 

> y2 up (new float[num pts]}; 
[ 
[ 


> y4 up (new float[num pts]}; 
float* y1 = y1 up.get(); 


y2 up.get(); 
y3 up.get(); 
y4 up.get(); 


bool rc2 = Convolvei (y2, x2, num pts, kernel, ks); 
bool rc3 = Convolve1Ks5Cpp(y3, x2, num pts, kernel, ks); 


bool rc4 = ConvolveiKs5 (y4, x2, num pts, kernel, ks); 


cout «« "Results 


cout << " rci = 
cout << " rc2 = 
cout << " rc3 = 
cout << " rc4 = 
if (!rc1 || !rc2 
return; 


// Save data 


for Convolve1\n"; 
«« boolalpha «« 
«« boolalpha 
«« boolalpha 
«« boolalpha 


<< 
<< 
<< 


|| !zc3 || !rc4) 


TCL. << 
IC2 << 
IC3 << 
ICA << 


An": 
'An'5 
"Ans 
"Ans 


const char* fn = "Chi11 01 ConvolveiResults.csv"; 


ofstream ofs(fn); 


if (ofs.bad()) 


cout << "File create error - 


else 


{ 


const char* delim = ", "; 


<< fn << ‘\n'; 


ofs << fixed << setprecision(7); 
ofs << "i, x1, y1, y2, y3, y4\n"; 


for (int i = 0; i « num pts; i++) 


( 


ofs «« setw(5) «« i «« delim; 


ofs «« setw(10) «« 
ofs «« setw(10) «« 


ofs «« setw(10) «« y2 
ofs «« setw(10) «« y3 
ofs << setw(10) << y4 


xi[i] 


<< 
<< 
<< 
<< 
<< 


delim; 
delim; 
delim; 
delim; 
'\n'; 
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} 
ofs.close(); 
cout << "\nConvolution results saved to file " << fn << ‘\n'; 
} 
J 
bool Convolve1Cpp(float* y, const float* x, int num pts, const float* kernel, int kernel_ 
size) 
{ 
int ks2 = kernel_size / 2; 
if ((kernel_size & 1) == 0) 
return false; 
if (kernel_size < c_KernelSizeMin || kernel_size > c_KernelSizeMax) 
return false; 
if (num pts < c NumPtsMin || num pts > c NumPtsMax) 
return false; 
X += ks2;  // x points to first signal point 
for (int i = 0; i « num pts; i++) 
{ 
float sum = 0; 
for (int k = -ks2; k <= ks2; k++) 
{ 
float x val = x[i - k]; 
float kernel val = kernel[k + ks2]; 
sum += kernel val * x val; 
) 
yli] = sum; 
) 
return true; 
) 
bool ConvolveiKss5Cpp(float* y, const float* x, int num pts, const float* kernel, int kernel 
size) 
{ 


int ks2 = kernel size / 2; 


if (kernel size !- 5) 
return false; 
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if (num pts < c NumPtsMin || num pts > c NumPtsMax) 
return false; 


X += ks2;  // x points to first signal point 


for (int i = 0; i « num pts; i++) 


( 
float sum = 0; 
int j = i + ks2; 
sum += x[j] * kernel[o]; 
sum += x[j - 1] * kernel[1]; 
sum += x[j - 2] * kernel[2]; 
sum += x[j - 3] * kernel[3]; 
sum += x[j - 4] * kernel[4]; 
y[i] = sum; 

} 


return true; 


main() 
int ret val = 1; 


try 

{ 
Convolve1(); 
Convolvei BM(); 
ret val - 0; 


) 
catch (runtime error& rte) 
{ 
cout << "run time exception has occurred\n"; 
cout << rte.what() << '\n'; 
) 
catch (...) 
{ 
cout << "Unexpected exception has occurred\n"; 
} 


return ret val; 
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include <MacrosX86-64-AVX.asmh> 
extern c NumPtsMin:dword 

extern c NumPtsMax:dword 

extern c KernelSizeMin:dword 
extern c KernelSizeMax:dword 


; extern "C" bool Convolve1 (float* y, const float* x, int num pts, const float* kernel, int 


kernel size) 


.code 

Convolvei proc frame 
_CreateFrame CV ,0,0,rbx,rsi 
_EndProlog 


; Verify argument values 
xor eax,eax 


mov r10d,dword ptr [rbp+CV_OffsetSta 
test r10d,1 

jz Done 

cmp r10d,[c KernelSizeMin] 

jl Done 

cmp r10d,[c KernelSizeMax] 

jg Done 


cmp r8d,[c NumPtsMin] 
jl Done 
cmp r8d,[c NumPtsMax] 
jg Done 


; Perform required initializations 
mov r8d,r8d 
shr r10d,1 
lea rdx,[rdx+r10*4] 


; Perform convolution 


LP1: vxorps xmm5,xmm5, xmm5 
mov r11,r10 
neg r11 

LP2: mov rbx,rax 


sub rbx,r11 
vmovss xmmO,real4 ptr [rdx+rbx*4] 
mov rsi,r11 
add rsi,r10 
vfmadd231ss xmm5,xmmo, [r9+rsi*4] 
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;set error code (rax is also loop index var) 


ckArgs] 
;jump if kernel size is even 
;jump if kernel size too small 


;jump if kernel size too big 


;jump if num pts too small 
;jump if num pts too big 
;18 - num pts 


5ks2 = ks / 2 
;Ydx = x + ks2 (first data point) 


we 
un 
c 
3 
I 


0.0; 
jk = -ks2 

jrbx = i - k 
;xmmo = x[i - k] 


srsi = k + ks2 
;sum += x[i - k] * kernel[k + ks2] 


add r11,1 
cmp r11,r10 
jle LP2 


vmovss real4 ptr [rcx+rax*4], xmm5 


add rax,1 
cmp rax,r8 
ji LPa 


mov eax,1 


Done:  vzeroupper 
_DeleteFrame rbx,rsi 
ret 

Convolvei endp 
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5k++ 

;jump if k <= ks2 
syli] = sum 

si += 1 

;jump if i < num pts 


;set success return code 


; extern "C" bool ConvolveiKs5 (float* y, const float* x, int num pts, const float* kernel, 


int kernel size) 


ConvolveiKs5 proc 
; Verify argument values 
xor eax,eax 


cmp dword ptr [rsp+40],5 
jne Done 


cmp r8d,[c_NumPtsMin] 
jl Done 
cmp r8d,[c NumPtsMax] 
jg Done 


; Perform required initializations 
mov r8d,r8d 
add rdx,8 


; Perform convolution 

QQ: vxorps xmm4,xmm4, xmm4. 
vxorps xmm5 ,xmmb5,xmm5 
mov r11,rax 
add r11,2 


vmovss xmmO,real4 ptr [rdx+r11*4] 


vfmadd231ss xmm4,xmmo, [r9] 


vmovss xmm1,real4 ptr [rdx+r11*4-4] 
vfmadd231ss xmm5,xmmi, [r9+4] 


vmovss xmmO,real4 ptr [rdx+r11*4-8] 
vfmadd231ss xmm4,xmmo, [r9+8] 


;set error code (rax is also loop index var) 


;jump if kernel size is not 5 


;jump if num pts too small 


;jump if num pts too big 


;18 = num pts 
3X += 2 


;initialize sum vars 


3) = i + ks2 

;xmmo = x[j] 

;xmm4 += x[j] * kernel[o] 
;xmmi = x[j - 1] 

;xmm5 += x[j - 1] * kernel[1] 


;xmmo = x[j - 2] 
;xmm4 += x[j - 2] * kernel[2] 
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vmovss xmm1,real4 ptr [rdx+r11*4-12] ;xmmi = x[j - 3] 
vfmadd231ss xmm5,xmmi, [r9«12] ;xmm5 += x[j - 3] * kernel[3] 
vmovss xmmO,real4 ptr [rdx+r11*4-16] ;xmmo = x[j - 4] 
vfmadd231ss xmm4,xmmo, [r9+16] ;xmm4 += x[j - 4] * kernel[4] 


vaddps xmm4 ,xmm4 , xmm5 


vmovss real4 ptr [rcx+rax*4],xmm4 ;save y[i] 

inc rax si += 1 

cmp rax, 18 

jl GB ;junp if i < num pts 
mov eax,1 ;set success return code 


Done:  vzeroupper 
ret 

ConvolveiKs5  endp 
end 


The C++ code in Listing 11-1 begins with the header file Ch11 01.h, which contains the requisite 
function declarations for this example. The source code for function CreateSignal is next. This function 
constructs a synthetic input signal for test purposes. The synthetic input signal consists of three separate 
sinusoidal waveforms that are summed. Each waveform includes a small amount of random noise. The 
input signal generated by CreateSignal is the same signal that's shown in the top plot of Figure 11-1. 

When performing convolutions, it is often necessary to pad the input signal array with extra elements to 
avoid invalid memory accesses when the center point of the convolution kernel is superimposed over input 
signal array elements located near the beginning and end of the array. The function PadSignal creates a 
padded copy of input signal array x1 by reflecting the edge elements of x1 and saving these elements along 
with the original input signal array elements in x2. Figure 11-3 shows an example of a padded input signal 
array that's compatible with a five-element convolution kernel. Note that n2, the size of the padded buffer, 
must equal n1 + ks2 * 2, where n1 represents the number of input signal array elements in x1 and ks2 
corresponds to floor(kernel size / 2). 


x1[0] x1[1] x1[n1-2] x1[n1- 1] 
ignal Array 
x2[0] x2[1] x2[2] x2[3] x2[n2-4] x2[n2-3] x2[n2-2] »x2[n2- 1] 


Padded Input 
Signal Array 


Pad Signal Pad 
Points Points Points 


Figure 11-3. Padded input signal array following execution of PadSignal using a five-element convolution 
kernel 
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The C++ function Convolve1 contains code that exercises several different implementations of the 
discrete convolution algorithm. Near the top of this function is a single-precision floating-point array 
named kernel, which contains the convolution kernel coefficients. The coefficients in kernel represent a 
discrete approximation of a Gaussian (or low pass) filter. When convolved with an input signal array, these 
coefficients reduce the amount of noise that's present in a signal, as shown in the bottom plot of Figure 11-1. 
The padded input signal array x2 is created next using the previously described functions CreateSignaland 
PadSignal. Note that the C++ template class unique ptr«» is used to manage storage space for both x2 and 
the unpadded array x1. 

Following the generation ofthe input signal array x2, Convolve1 allocates storage space for four output 
signal arrays. The functions that implement different variations of the convolution algorithms are then 
called. The first two functions, Convolve1Cpp and Convolve1_, contain C++ and assembly language code 
that carry out their convolutions using the nested for loop technique described earlier in this section. 

The functions ConvolveiKs5Cpp and Convolvei1Ks5 are optimized for convolution kernels containing five 
elements. Real-world signal processing software regularly employs convolution functions that are optimized 
for specific kernel sizes since they're often significantly faster, as you will soon see. 

The function Convolve1Cpp begins its execution by validating argument values kernel size and 
num pts. The next statement, x += ks2, adjusts the input signal array pointer so that it points to the first 
true input signal array element. Recall that the input signal array x is padded with extra values to ensure 
correct processing when the convolution kernel is superimposed over the first two and last two input signal 
elements. Following the pointer x adjustment is the actual code that performs the convolution. The nested 
for loops implement the discrete convolution equation that was described earlier in this section. Note 
that the index value used for kernel is offset by ks2 to account for the negative indices of the inner loop. 
Following function Convolve1Cpp is the function Convolve1Ks5Cpp, which uses explicit C++ statements to 
calculate convolution sum-of-products instead of a for loop. 

The functions Convolvei and ConvolveiKs5 are the assembly language counterparts of ConvolveiCpp 
and ConvolveiKs5Cpp, respectively. After its prolog, Convolvei validates argument values kernel size and 
num pts. This is followed by an initialization code block that begins by loading num pts into R8. The ensuing 
shr r10d,1 instruction loads ks2 into register R10D. The final initialization code block instruction, lea 
rdx, [rdx«r10*4], loads register RDX with the address of the first input signal array element in x. 

Similar to the C++ code, Convole1_ uses two nested for loops to perform the convolution. The outer 
loop, which is labeled LP1, starts with a vxorps xmm5,xmm5, xmm5 instruction that sets sum equal to 0.0. The 
ensuing mov r11,r10 and neg r11 instructions set inner loop index counter k (R11) to -ks2. The label LP2 
marks the start of the inner loop. The mov rbx,rax and sub rbx,r11 instructions calculate the index, or 
i - k, ofthe next element in x. This is followed by a vmovss xmmo,real4 ptr [rdx+rbx*4] instruction that 
loads x[i - k] into XMMO. Next, the mov rsi,r11and add rsi,r10 instructions calculate k + ks2. The 
subsequent vfmadd231ss xmm5,xmmo, [r9+rsi*4] instruction calculates sum += x[i - k] * kernel[k 
* ks2]. As discussed in Chapter 8, the FMA instruction vfmadd231ss carries out its operations using a 
single rounding operation. Depending on the algorithm, the use of a vfmadd231ss instruction instead of an 
equivalent sequence of vmulss and vaddss instructions may result in slightly faster execution times. 

Following the execution of the vfmadd231ss instruction, the add 111,1 instruction computes k++ and 
the inner loop repeats until k » ks2 is true. Subsequent to the completion of the inner loop, the vmovss 
real4 ptr [rcx«rax*4],xmm5 instruction saves the current sum-of-products result to y[i]. The add rax,1 
instruction updates index counter i, and the outer loop LP1 repeats until all of the input signal data points 
have been processed. 

The assembly language function ConvolveiKs5 is size-optimized for convolution kernels that 
contain five elements. This function replaces the inner loop that was used in Convolvei with five explicit 
vfmadd231ss instructions. Note that this FMA instruction sequence employs two separate registers, XMM4 
and XMM5, for the intermediate sums. Most Intel processors that support AVX2 and FMA can execute two 
scalar FMA instructions simultaneously, which accelerates the algorithm's overall performance. Using a 
single sum register here would create a performance-degrading data dependency since each vfmadd231ss 
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instruction would need to finish its operation before the next one could begin. You'll learn more about 
instruction-level data dependencies and FMA execution units in Chapter 15. Here is the output for source 
code example Ch11-01: 


Results for Convolve1 


rci = true 
rc2 - true 
rc3 = true 
rc4 = true 


Convolution results saved to file Ch11 01 ConvolveiResults.csv 


Running benchmark function Convolve1 BM - please wait 
Benchmark times save to file Ch11 01 Convolve1 BM CHROMIUM. csv 


Table 11-1 contains mean execution times for the convolution functions presented in this section. 
As alluded to earlier, the size-optimized convolution functions Convolve1Ks5Cpp and ConvolveiKs5 are 
considerably faster than their size-independent counterparts. Note that the performance of the C++ function 
Convolve1Cpp is somewhat better than its assembly language equivalent Convolve1_. The reason for this 
is that the Visual C++ complier generated code that partially unrolled the inner loop and replaced it with a 
series of sequential scalar single-precision floating-point multiply and add instructions. I could have easily 
implemented the same optimization technique in the function Convolve1_, but this improves performance 
only by a few percentage points. In order to achieve maximum FMA performance, an assembly language 
convolution function must use packed FMA instructions instead of scalar ones. You'll see an example of this 
in the next section. 


Table 11-1. Mean Execution Times (Microseconds) for Convolution Functions Using Five-Element 
Convolution Kernel (2,000,000 Signal Points) 


CPU Convolve1Cpp Convolve1_ ConvolveiKs5Cpp  ConvolveiKs5_ 
i7-4790S 6148 6844 2926 2841 
i9-7900X 5607 6072 2808 2587 
i7-8700K 5149 5576 2539 2394 


Packed FMA 


It’s okay to use the convolution functions discussed in the previous section with small signal arrays. In 
many real-world applications, however, convolutions are often performed using signal arrays that contain 
thousands or millions of data points. For large signal arrays, the basic convolution algorithm can be adapted 
to use packed FMA instead of scalar FMA instructions to carry out the required calculations. Listing 11-2 
shows the source code for example Ch11_02, which illustrates how to implement the discrete convolution 
equation using packed FMA instructions. 
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Listing 11-2. Example Ch11_02 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 
#include <fstream> 
#include "Ch11_02.h" 
#include "AlignedMem.h" 


using namespace std; 


extern "C" const int c NumPtsMin = 32; 

extern "C" const int c_NumPtsMax = 16 * 1024 * 1024; 
extern "C" const int c KernelSizeMin = 3; 

extern "C" const int c KernelSizeMax - 15; 

unsigned int g RngSeedVal - 97; 


void Convolve2(void) 
1 
const int n1 = 512; 
const float kernel[] ( 0.0625f, 0.25f, 0.375f, 0.25f, 0.0625f }; 
const int ks - sizeof(kernel) / sizeof(float); 
const int ks2 = ks / 2; 
const int n2 = n1 + ks2 * 2; 
const unsigned int alignment - 32; 


// Create signal array 
AlignedArray«float» x1 aa(ni1, alignment); 
AlignedArray«float» x2 aa(n2, alignment); 
float* x1 = x1 aa.Data(); 

float* x2 - x2 aa.Data(); 


CreateSignal(x1, ni, ks, g RngSeedVal); 
PadSignal(x2, n2, x1, n1, ks2); 


// Perform convolutions 
AlignedArray«float» y5 aa(ni, alignment); 
AlignedArray«float» y6 aa(ni, alignment); 
AlignedArray«float» y7_aa(n1, alignment); 
float* y5 = y5 aa.Data(); 

float* y6 - y6 aa.Data(); 

float* y7 = y7 aa.Data(); 


bool rc5 = Convolve2 (y5, x2, n1, kernel, ks); 
bool rc6 = Convolve2Ks5 (y6, x2, ni, kernel, ks); 
bool rc7 = Convolve2Ks5Test_(y7, x2, n1, kernel, ks); 
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} 


int 


{ 
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cout << "Results for Convolve2\n"; 


cout << " rc5 = " << boolalpha << rc5 << '\n'; 
cout << " rc6 = " << boolalpha << rc6 << '\n'; 
cout << " rc7 = " << boolalpha << rc7 << '\n'; 


if (!rc5 || !rc6 
return; 


| !rc7) 


// Save data 


const char* fn = "Ch11 O2 Convolve2Results.csv"; 


ofstream ofs(fn); 


if (ofs.bad()) 
cout «« "File create error - 
else 


{ 


const char* delim = ", "; 


ofs «« fixed «« setprecision(7); 


ofs << "i, x1, y5, y6, y7\n"; 


for (int i = 0; i < n1; ie) 

{ 
ofs << setw(5) << i << delim; 
ofs << setw(10) << x1[i] << delim; 
ofs << setw(10) << y5[i] << delim; 
ofs << setw(10) << y6[i] << delim; 
ofs << setw(10) << y7[i] 


3 


if (y6li] != y7[i]) 
ofs << delim << '*'; 


ofs << ‘\n'; 


} 


ofs.close(); 
cout << "\nResults data saved to file 


main() 


int ret_val = 1; 


try 

{ 
Convolve2(); 
Convolve2_BM(); 
ret_val = 0; 

} 


<< fn << '\n'; 


<< fn << 


CHAPTER 11 = AVX2 PROGRAMMING — EXTENDED INSTRUCTIONS 


catch (runtime error& rte) 


{ 
cout << "run time exception has occurred\n"; 
cout << rte.what() << '\n'; 
} 
catch (...) 
{ 
cout << "Unexpected exception has occurred\n"; 
} 
return ret val; 
} 
Š ---------2-2.-2-2-2-2-22-2-22-22-2-2-2-22-2-2-2-2-222-2-2--2-2-2-2--2-22-2------- 
; Ch11_02_.asm 


include <MacrosX86-64-AVX.asmh> 
extern c NumPtsMin:dword 

extern c NumPtsMax: dword 

extern c KernelSizeMin:dword 
extern c KernelSizeMax:dword 


; extern bool Convolve2 (float* y, const float* x, int num pts, const float* kernel, int 
kernel size) 


.code 

Convolve2 proc frame 
_CreateFrame CV2 ,0,0,rbx 
 EndProlog 


; Validate argument values 
xor eax,eax ;set error code 


mov ri0d,dword ptr [rbp«CV2 OffsetStackArgs] 
test r10d,1 


jz Done ;kernel size is even 
cmp r10d,[c KernelSizeMin] 

jl Done ;kernel size too small 
cmp r10d,[c KernelSizeMax] 

jg Done ;kernel size too big 


cmp r8d,[c NumPtsMin] 


jl Done ;num pts too small 

cmp r8d,[c NumPtsMax] 

jg Done ;num pts too big 

test r8d,7 

jnz Done ;num pts not even multiple of 8 


test rcx, 1fh 
jnz Done ;y is not properly aligned 
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; Initialize convolution loop variables 


shr r10d,1 ;110 = kernel size / 2 (ks2) 
lea rdx,[rdx«r10*4] ;Ydx = x + ks2 (first data point) 
xor ebx,ebx $1 = 0 


; Perform convolution 


LP1: vxorps ymmO, ymmO, ymmO ;packed sum = 0.0; 
mov r11,r10 5111 = ks2 
neg r11 ;K = -ks2 
LP2: mov rax,rbx jyax = i 
sub rax,r11 jyax = i - k 
vmovups ymm1,ymmword ptr [rdx+rax*4] ;load x[i - k]:x[i - k + 7] 


mov rax,r11 


add rax,r10 ;rax = k + ks2 

vbroadcastss ymm2,real4 ptr [r94rax*4] ;ymm2 = kernel[k + ks2] 

vfmadd231ps ymmo, ymm1, ymm2 ;ymmo += x[i-k]:x[i-k+7] * kernel[k+ks2] 
add r11,1 jk += 1 

cmp r11,r10 

jle LP2 ;repeat until k » ks2 


vmovaps ymmword ptr [rcx+rbx*4] , ymmo ;save y[i]:y[i + 7] 


add rbx,8 ji += 8 

cmp rbx,r8 

jl LP1 ;repeat until done 

mov eax,1 ;set success return code 


Done: vzeroupper 
_DeleteFrame rbx 
ret 

Convolve2 endp 


; extern bool Convolve2Ks5 (float* y, const float* x, int num pts, const float* kernel, int 
kernel size) 


Convolve2Ks5 proc frame 
_CreateFrame CKS5 ,0,48 
_SaveXmmRegs xmm6,xmm7,xmm8 
 EndProlog 


; Validate argument values 
xor eax,eax ;set error code (rax is also loop index var) 


cmp dword ptr [rbp+CKS5 OffsetStackArgs],5 
jne Done ;jump if kernel size is not 5 
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jl Done ;jump if num pts too small 

cmp r8d,[c NumPtsMax] 

jg Done ;jump if num pts too big 

test r8d,7 

jnz Done ;num pts not even multiple of 8 


test rcx, 1fh 


jnz Done ;y is not properly aligned 


; Perform required initializations 


vbroadcastss ymm4,real4 ptr [r9] 
vbroadcastss ymm5,real4 ptr [r944] 
vbroadcastss ymm6,real4 ptr [r9+8] 
vbroadcastss ymm7,real4 ptr [r9+12] 
vbroadcastss ymm8,real4 ptr [19+16] 
mov r8d,r8d 

add rdx,8 


; Perform convolution 


QQ: 


vxorps ymm2,ymm2,ymm2 
vxorps ymm3,ymm3,ymm3 
mov r11,rax 

add r11,2 


vmovups ymmO,ymmword ptr [rdx+r11*4] 
vfmadd231ps ymm2,ymmo,ymm4 


vmovups ymm1,ymmword ptr [rdx+r11*4-4] 
vfmadd231ps ymm3,ymm1, ymm5 


vmovups ymmO,ymmword ptr [rdx+r11*4-8] 
vfmadd231ps ymm2,ymmo, ymm6 


vmovups ymm1,ymmword ptr [rdx+r11*4-12] 
vfmadd231ps ymm3,ymmi,ymm7 


vmovups ymmO,ymmword ptr [rdx+r11*4-16] 
vfmadd231ps ymm2,ymmO, ymm8 


vaddps ymmo, ymm2, ymm3 
vmovaps ymmword ptr [rcx+rax*4], ymmo 


add rax,8 
cmp rax,r8 
jl GB 

mov eax,1 


;kernel[O 
;kernel[1 


;kernel[3 
;kernel[4 


;18 = 
3X += 


] 
] 
;kernel[2] 
] 
] 


num pts 
2 


;initialize sum vars 


5) = i + ks2 

;ymmo = x[j]:x[j + 7] 

;ymm2 += x[j]:x[j + 7] * kernel[o] 
;ymmi = x[j - 1]:x[j + 6] 

;ymm3 += x[j - 1]:x[j + 6] * kernel[1] 
symmo = x[j - 2]:x[j + 5] 

;ymm2 += x[j - 2]:x[j + 5] * kernel[2] 
symma = x[j - 3]:x[j + 4] 

;ymm3 += x[j - 3]:x[j + 4] * kernel[3] 
symmo = x[j - 4]:x[j + 3] 

;ymm2 += x[j - 4]:x[j + 3] * kernel[4] 


;final values 


;save 
ji t= 


3 jump 


ylilyli + 7] 
8 


if i < num pts 


;set success return code 
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Done:  vzeroupper 
 RestoreXmmRegs xmm6, xmm7 ,xmm8 


_DeleteFrame 
ret 

Convolve2Ks5_ endp 
end 


The convolution functions in source code example Ch11 01 used single-precision floating-point signal 
arrays and convolution kernels. Recall that a 256-bit wide YMM register can hold eight single-precision 
floating-point values, which means that a SIMD implementation of the convolution algorithm can carry 
out eight FMA calculations simultaneously. Figure 11-4 contains two graphics that illustrate a five-element 
convolution kernel along with an arbitrary segment of an input signal array. Below the graphics are the 
equations that can be used to convolve the eight input signal points f[i]:f[i*7] with a five-element 
convolution kernel. These equations are a simple expansion of the discrete convolution equation that was 
discussed earlier in this section. Note that each column of the SIMD convolution equation set includes a 
single kernel value and eight consecutive elements from the input signal array. This means that a SIMD 
convolution function can be easily implemented using data broadcast, packed move, and packed FMA 
instructions, as you'll soon see. 


Convolution kernel 


e2 | ern | tos | ew | ee | 


Input signal array 


C Fele Tenes Tes es EE CE em eo] — 


SIMD convolution equations (8 signal points) 


h[i*0] = g[-2]f[i*2] + g[-1]f[i*1] + g[O]f[i*O] + g[1]fli-1] + gI[21]£[i-2 
h[i*1] = g[-2]f[i*3] + g[-1]f[i*2] + g[O]f[it1] + g[1]f[i*O] + gI[2]£[i-1] 
h[i*2] = g[-2]f[it4] + g[-1]f[i*3] + g[0]f[i*2] + g[1]f[i+1] + g[2]f[i+0] 
hli+3] = g[-2]f[i*5] + g[-1]f[i*4] + g[0]f[i*3] + g[1]f[i+2] + g[2]f[i*l 
hli+4] = q[-2]f[i*6] + g[-1]f[it5] + g[0]f[i*4] + g[1]f[i+3] + g[2]f[i42 
h[i*5] = g[-2]f[i*7] + g[-1]f[i*6] + g[O]f[i*t5] + g[1]f[i+4] + g[2]f[i*3 
h[i*6] = g[-2]£[it8] + g[-1]£[i*7] + g[O]f[i*6] + g[1]f[i*5] + g[2]£[i*4 
h(i+7] = g[-2]f[i*9] + g[-1]f[i*8] + g[O]f[i*7] + g(1]f[i*6] + g[2]f[i*5 


Î | | | | 


Consecutive input signal array elements 
Figure 11-4. SIMD convolution equations for five-element convolution kernel 


The C++ function Convolve2 is located near the top of Listing 11-2. This function creates and initializes 
the padded input signal array x2 using the same CreateSignal and PadSignal functions that were employed 
in example Ch11 01. It also uses the C++ template class AlignedArray to allocate storage for the output 
signal arrays y5 and y6. In this example, the output signal arrays must be properly aligned since the assembly 
language functions use the vmovaps instruction to save calculated results. Proper alignment of the input 
signal arrays x1 and X2 is optional and used here for consistency. Following the allocation of the signal 
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arrays, Convolve2 invokes the assembly language functions Convolve2 and ConvolveKs5 ; it then saves the 
output signal array data to a CSV file. 

The function Convolve2 begins its execution by validating argument values kernel size and 
num pts. It also verifies that output signal array y is properly aligned. The convolution code block of 
Convolve2 employs the same nested for loop construct that was used in example Ch11 01. Each outer 
loop LP1 iteration begins with a vxorps ymmo, ymmo, ymmO instruction that initializes eight single-precision 
floating-point sum values to 0.0. The ensuing mov r11,r10 and neg r11 instructions initialize k with the 
value -ks2. The inner loop starts by calculating and saving i - kin register RAX. The subsequent vmovups 
ymm1,ymmword ptr [rdx+rax*4] instruction loads input signal array elements x[i - k]:x[i - k + 7] 
into register YMM1. This is followed by a vbroadcastss ymm2,real4 ptr [r9+rax*4] instruction that 
broadcasts kernel[k + ks2] to each single-precision floating-point element position in YMM2. The 
vfmadd231ps ymmo, ymm1, ymm2 instruction multiplies each input signal array element in YMM1 by kernel[k 
+ ks2] and adds this intermediate result to the packed sums in register YMMO. The inner loop repeats until 
k > ks2 is true. Following the completion of the inner loop, the vmovaps ymmword ptr [rcx+rbx*4], ymmo 
instruction saves the eight convolution results to output signal array elements y[i]:y[i + 7]. Outer loop 
index counter i is then updated and the loop repeats until all input signal elements have been processed. 

The assembly language function Convolve2Ks5 is optimized for five-element convolution kernels. 
Following the requisite argument validations, a series of vbroadcastss instructions load kernel coefficients 
kernel[0]-kernel[4] into registers YMM4-YMMS8, respectively. The two vxorps instructions located at 
the top ofthe processing loop initialize the intermediate packed sums to 0.0. The array index j = i + 
ks2 is then calculated and saved in register R11. The ensuing vmovups ymmo, ymmword ptr [rdx+r11*4] 
instruction loads input signal array elements x[j]:x[j + 7] into register YMMO. This is followed by a 
vfmadd231ps ymm2,ymmo,ymm4 instruction that multiplies each input signal array element in YMMO by 
kernel[0]; it then adds these values to the intermediate packed sums in YMM2. Convolve2Ks5 then 
uses four additional sets of vmovups and vfmadd231ps instructions to compute results using coefficients 
kernel[1]-kernel[4].Similar to the function ConvolveiKs5 in source code example Ch11 01, 
Convolve2Ks5 also uses two YMM registers for its intermediate FMA sums, which facilitates simultaneous 
execution of two vfmadd231ps instructions on processors with dual 256-bit wide FMA execution units. 
Following the FMA operations, a vnovaps ymmword ptr [rcx«rax*4], ymmo instruction saves the eight 
output signal array elements to y[i]:y[i + 7]. 

The assembly language code for example Ch11 02 also includes a function named Convolve2Ks5Test . 
This function replaces all occurrences of the instruction vfmadd231ps with an equivalent sequence of 
vmulps and vaddps instructions for benchmarking and value comparison purposes, which are discussed 
shortly. The source code for Convolve2Ks5Test is not shown in Listing 11-2 but is included with the chapter 
download package. Here are the results for source code example Ch11 02. 


Results for Convolve2 


rc5 - true 
rc6 - true 
rc7 = true 


Results data saved to file Ch11 02 Convolve2Results.csv 


Running benchmark function Convolve2_BM - please wait 
Benchmark times save to file Ch11 02 Convolve2 BM CHROMIUM.csv 


Table 11-2 shows the benchmark timing measurements for the SIMD implementations of the 
convolution functions. As expected, the SIMD versions are significantly faster than their non-SIMD 
counterparts (see Table 11-1). The mean execution times for the five-element convolution kernel functions 
Convolve2Ks5 and Convolve2Ks5Test_ are essentially the same. 
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Table 11-2. Mean Execution Times (Microseconds) for SIMD Convolution Functions Using Five-Element 
Convolution Kernel (2,000,000 Signal Points) 


CPU Convolve2_ Convolve2Ks5_ Convolve2Ks5Test_ 
i7-4790S 1244 1067 1074 

i9-7900X 956 719 709 

i7-8700K 859 595 597 


Itis not uncommon for small value discrepancies to occur between a function that uses FMA 
instructions and an equivalent function that uses distinct multiply and add instructions. This is confirmed 
by the comparing the output results of functions Convolve2Ks5 and Convolve2Ks5Test_. Table 11-3 shows 
a few examples of value discrepancies from the output file Ch11 02 Convolve2Results.csv. In a real-world 
application, the magnitudes of these value discrepancies would most likely be inconsequential. However, 
the potential for value discrepancies is something that you should always keep in mind if you're developing 
production code that includes both FMA and non-FMA versions of the same function, especially for 
functions that perform numerous FMA operations. 


Table 11-3. Examples of Value Discrepancies Using FMA and Non-FMA Instruction Sequences 


Index x[] Convolve2Ks5 Convolve2Ks5Test 
33 1.3856432 1.1940877 1.1940879 

108 1.3655651 1.4466031 1.4466029 

180 -2.8778596 -2.7348523 -2.7348526 

277 -1.7654022 -2.0587211 -2.0587208 

403 2.0683382 2.0299273 2.0299270 


General-Purpose Register Instructions 


As mentioned and discussed in Chapter 8, several general-purpose register instructions set extensions have 
been added to the x86 platform in recent years (see Tables 8-2 and 8-5). In this section, you learn how to use 
some of these instructions. The first source code example illustrates how to use the flagless multiplication 
and shift instructions. A flagless instruction executes its operation without modifying any of the status 

flags in RFLAGS, which can be faster than an equivalent flag-based instruction depending on the specific 
use case. The second source code example demonstrates several advanced bit-manipulation instructions. 
The source code examples of this section require a processor that supports the BMI1, BMI2, and LZCNT 
instruction set extensions. 


Flagless Multiplication and Shifts 


Listing 11-3 shows the source code for example Ch11 03. This example demonstrates how to use the flagless 
unsigned integer multiplication instruction mulx. It also explains how to use the flagless shift instructions 
sarx, shlx, and shrx. 
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Listing 11-3. Example Ch11_03 


#include "stdafx.h" 
#include <cstdint> 
#include <iostream> 
#include <iomanip> 
#include <sstream> 


using namespace std; 

#include "stdafx.h" 

extern "C" uint64 t GprMulx (uint32 t a, uint32 t b, uint64 t flags[2]); 

extern "C" void GprShiftx (uint32 t x, uint32 t count, uint32 t results[3], uint64 t 
flags[4]); 

string ToString(uint64 t flags) 


ostringstream oss; 


oss << "OF=" << ((flags & (AULL << 11)) ? '1' : '0') «« ' 5 
oss << "SF=" << ((flags & (AULL << 7)) ? '1' : '0') «« ' SS 
oss << "ZF=" << ((flags & (AULL << 6)) ? '1' : 'O') «« ' 5 
oss << "PF=" << ((flags & (AULL << 2)) ? '1' : 'O') << ' 5 
oss << "CF=" << ((flags & (AULL << 0)) ? '1' : 'O') << ' 5 


return oss.str(); 


1 

void GprMulx(void) 

1 
const int n = 3; 
uint32 t a[n] = (64, 3200, 100000000]; 
uint32 t b[n] = (1001, 12, 250000000]; 


cout << "\nResults for AvxGprMulx\n"; 


for (int i = 0; i < n; i+) 
{ 
uint64 t flags[2]; 
uint64 t c - GprMulx (a[i], b[i], flags); 


cout << "\nTest case " << i << ‘\n'; 
LL " 


cout << a: "«ca[i] <<" b: " << bli] <<" c: "««c«« '\n'; 
cout «« setfill ('0') «« hex; 


cout << " status flags before mulx: " << ToString(flags[0]) << ‘\n'; 
cout «« status flags after mulx: " << ToString(flags[1]) << ‘\n'; 
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cout << setfill (' ') << dec; 


= 4; 
]s 
nt[n 


{ 0x00000008, 0x80000080, 0x00000040, Oxfffffc10 }; 
] = { 2, 5, 3, 4 le 


cout << "\nResults for AvxGprShiftx\n"; 


for (int i = 0; i < nj; i+) 


uint32_t results[3]; 
uint64 t flags[4]; 


GprShiftx (x[i], count[i], results, flags); 


setfill(' ') 
"AnTest case 


setfill('0') << 
setfill(' ') «« 


setfill('0') << 
setfill(' ') «« 


setfill('0') << 
setfill(' ') «« 


setfill('0') << 
setfill(' ') «« 


status flags 
status flags 
status flags 
status flags 


} 
} 
void GprShiftx(void) 
{ 
const int n 
uint32 t x[n 
uint32 t cou 
( 
cout «« 
cout «« 
cout «« 
cout «« 
cout «« 
cout «« 
cout «« 
cout «« 
cout «« 
cout «« 
cout «« 
cout «« 
cout «« 
cout «« 
} 
} 
int main() 
{ 
GprMulx() ; 
GprShiftx(); 
return 0; 
} 
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<< dec; 
<< i << 


hex 
dec 


hex 
dec 


hex 
dec 


hex 
dec 


<< 
<< 


<< 
<< 


<< 
<< 


<< 
<< 


before 
after sarx: " << ToString(flags[1]) << '\n'; 
after shlx: " << ToString(flags[2]) << '\n'; 
after shrx: " << ToString(flags[3]) << ‘\n'; 


‘\n'; 


x: Ox" «« setw(8) «« x[i] «« A 
x[i] << ") count: " << count[i] << '\n'; 


sarx: Ox" << setw(8) << results[o] << 
results[0] << ")\n"; 

" shlx: Ox" << setw(8) << results[1] << " 
results[1] << ")\n"; 

" shrx: Ox" << setw(8) << results[2] << " 
results[2] << ")\n"; 


shifts: " << ToString(flags[0]) << ‘\n'; 


("5 


("5 
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; extern "C" uint64 t GprMulx (uint32 t a, uint32 t b, uint64 t flags[2]); 
; Requires BMI2 


.code 
GprMulx proc 


; Save copy of status flags before mulx 
pushfq 
pop rax 
mov qword ptr [r8],rax ;save original status flags 


; Perform flagless multiplication. The mulx instruction below computes 

; the product of explicit source operand ecx (a) and implicit source 

; operand edx (b). The 64-bit result is saved to the register pair r11d:r10d. 
mulx r11d,r10d,ecx sr11d:r10d = a * b 


; Save copy of status flags after mulx 
pushfq 
pop rax 
mov qword ptr [r848],rax ;save post mulx status flags 


; Move 64-bit result to rax 
mov eax,r10d 
shl r11,32 
Or rax,r11 
ret 
GprMulx  endp 


; extern "C" void GprShiftx (uint32 t x, uint32 t count, uint32 t results[3], uint64 t 
flags[4]) 
; 


; Requires BMI2 
GprShiftx_ proc 


; Save copy of status flags before shifts 
pushfq 
pop rax 
mov qword ptr [r9],rax ;save original status flags 


; Load argument values and perform shifts. Note that each shift 
; instruction requires three operands: DesOp, SrcOp, and CountOp. 


Sarx eax,ecx,edx ;shift arithmetic right 


mov dword ptr [r8],eax 
pushfq 
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pop rax 
mov qword ptr [r9+8],rax 


shlx eax,ecx,edx ;shift logical left 
mov dword ptr [r844],eax 

pushfq 

pop rax 

mov qword ptr [r9+16],rax 


shrx eax,ecx,edx ;shift logical right 
mov dword ptr [r848],eax 

pushfq 

pop rax 

mov qword ptr [r9+24],rax 


ret 
GprShiftx  endp 
end 


The C++ code for example Ch11 03 contains two functions named GprMulx and GprShiftx. These 
functions initialize test cases that demonstrate flagless multiplication and shift operations, respectively. Note 
that both GprMulx and GprShiftx define an array of type uint64 t named flags. This array is used to show 
the contents of the status flags in RFLAGS before and after the execution of each flagless instruction. The 
remaining code in both GprMulx and GprShiftx format and stream results to cout. 

The assembly language function GprMulx begins its execution by saving a copy of RFLAGS. The 
ensuing mulx r11d,r10d,ecx instruction performs a 32-bit unsigned integer multiplication using implicit 
source operand EDX (argument value b) with explicit source operand ECX (argument value a). The 64- 
bit product is then saved in register pair R11D:R10D. Following the execution of the mulx instruction, the 
contents of RFLAGS are saved again for comparison purposes. The mulx instruction also supports flagless 
multiplications using 64-bit operands. When used with 64-bit operands, register RDX is employed as the 
implicit operand and two 64-bit general-purpose registers must be used for the destination operands. 

The function GprShiftx includes examples of the sarx, shlx, and shrx instructions using 32-bit 
wide operands. These instructions use a three-operand syntax similar to AVX instructions. The first source 
operand is shifted by the count value that's specified in the second source operand. The result is then saved 
to the destination operand. The flagless shift instructions can also be used with 64-bit wide operands; 8- and 
16-bit wide operands are not supported. Here is the output for source code example Ch11 03: 


Results for AvxGprMulx 


Test case 0 
a: 64 b: 1001 c: 64064 
status flags before mulx: OF=0 SF=0 ZF-1 PF-1 CF=0 
status flags after mulx: OF=0 SF=0 ZF-1 PF-1 CF=0 


Test case 1 
a: 3200 b: 12 c: 38400 
status flags before mulx: OF=0 SF-1 ZF=0 PF=0 CF=1 
status flags after mulx: OF=0 SF-1 ZF=0 PF=0 CF=1 
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Test case 2 


a: 100000000 b: 250000000 c: 25000000000000000 
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status flags before mulx: OF=0 SF=1 ZF=0 PF=1 CF=1 
status flags after mulx: OF=0 SF-1 ZF=0 PF=1 CF=1 


Results for AvxGprShiftx 


Test case 0 
X: 0x00000008 (8) count: 2 
sarx: 0x00000002 (2) 
shlx: 0x00000020 (32) 
shrx: 0x00000002 (2) 


status flags before shifts: OF=0 SF=0 
status flags after sarx: OF=0 SF=0 
status flags after shlx: OF=0 SF=0 
status flags after shrx: OF=0 SF=0 


Test case 1 
x: 0x80000080 (2147483776) count: 5 
sarx: Oxfc000004 (4227858436) 
shlx: 0x00001000 (4096) 
shrx: 0x04000004 (67108868) 
status flags before shifts: OF=0 SF=1 
status flags after sarx: OF=0 SF=1 
status flags after shlx: OF=0 SF=1 
status flags after shrx: OF=0 SF=1 


Test case 2 
Xe 0x00000040 (64) count: 3 
sarx: 0x00000008 (8) 
shlx: 0x00000200 (512) 
shrx: 0x00000008 (8) 
status flags before shifts: OF=0 SF-1 
status flags after sarx: OF=0 SF=1 
status flags after shlx: OF=0 SF=1 
status flags after shrx: OF=0 SF=1 


Test case 3 

x: OxfffffciO (4294966288) count: 4 
sarx: Oxffffffci (4294967233) 

shlx: oxffffc100 (4294951168) 

shrx: Oxofffffci (268435393) 

status flags before shifts: OF=0 SF-1 
status flags after sarx: OF=0 SF=1 
status flags after shlx: OF=0 SF=1 
status flags after shrx: OF=0 SF=1 


ZF=0 
ZF=0 
ZF=0 
ZF=0 


ZF=0 
ZF=0 
ZF=0 
ZF=0 


ZF=0 
ZF=0 
ZF=0 
ZF=0 


PF=1 
PF=1 
PF=1 
PF=1 


PF=0 
PF=0 
PF=0 
PF=0 


PF=0 
PF=0 
PF=0 
PF=0 


PF=1 
PF=1 
PF=1 
PF=1 


CF=0 
CF=0 
CF=0 
CF=0 


CF=1 
CF=1 
CF=1 
CF=1 


CF=1 
CF=1 
CF=1 
CF=1 


CF=1 
CF=1 
CF=1 
CF=1 
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Enhanced Bit Manipulation 


Most of the instructions included in the BMI1 and BMI2 instruction set extensions are geared toward 
improving the performance of specific algorithms, such as data encryption and decryption. They also can 

be employed to simplify bit-manipulation operations in more mundane algorithms. Source code example 
Ch11 04 includes three simple assembly language functions that demonstrate use of the enhanced bit- 
manipulation instructions 1zcnt, tzcnt, bextr, and andn. Listing 11-4 shows the C++ and assembly language 
source code for this example. 


Listing 11-4. Example Ch11_04 


#include "stdafx.h" 
#include <cstdint> 
#include <iostream> 
#include <iomanip> 


using namespace std; 

extern "C" void GprCountZeroBits (uint32 t x, uint32 t* lzcnt, uint32 t* tzcnt); 
extern "C" uint32 t GprBextr (uint32 t x, uint8 t start, uint8 t length); 
extern "C" uint32 t GprAndNot (uint32 t x, uint32 t y); 


void GprCountZeroBits(void) 


( 
const int n = 5; 
uint32 t x[n] = { 0x001000008, 0x00008000, 0x8000000, 0x00000001, 0 }; 
cout << "\nResults for AvxGprCountZeroBits Wn"; 
for (int i = 0; i < n; i+) 
{ 
uint32_t lzcnt, tzcnt; 
GprCountZeroBits (x[i], &lzcnt, &tzcnt); 
cout << setfill('0') << hex; 
cout << "x: Ox" << setw(8) << x[i] << " '5 
cout << setfill(' ') << dec; 
cout << "lzcnt: " << setw(3) << lzcnt << " "5 
cout << "tzcnt: " << setw(3) << tzcnt << '\n'; 
} 
} 
void GprExtractBitField(void) 
{ 


const int n = 
uint32_t x[n] = { 0x12345678, 0x80808080, oxfedcba98 }; 
uint8_t start[n] = { 4, 7, 24 }; 


3; 
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uint8 t len[n] = { 16, 9, 8 }; 
cout << "\nResults for GprExtractBitField\n"; 


for (int i = 0; i < n; i+) 


{ 
uint32 t bextr = GprBextr (x[i], start[i], len[i]); 
cout << setfill('0') << hex; 
cout << "x: Ox" << setw(8) << x[i] <<" '5 
cout << setfill(' ') << dec; 
cout << "start: " << setw(3) << (uint32 t)start[i] <<" "; 
cout << "len:  " << setw(3) << (uint32 t)len[i] <<" ^"; 
cout << setfill('0') << hex; 
cout << "bextr: Ox" << setw(8) << bextr << '\n'; 
j 


void GprAndNot(void) 


( 


int 


const int n = 3; 
uint32 t x[n] = { oxfoooooof, oxffo0ff00, Oxaaaaaaaa }; 
uint32 t y[n] = { 0x12345678, 0x12345678, Oxffaa5500 }; 


cout << "\nResults for GprAndNot\n"; 


for (int i = 0; i < n; i+) 


{ 
uint32 t andn = GprAndNot (x[i], y[i]); 
cout << setfill('0') << hex; 
cout << "x: Ox" << setw(8) << x[i] <<" '5 
cout << "y: Ox" << setw(8) << y[i] «< " '5 
cout << "andn: Ox" << setw(8) << andn << ‘\n'; 

} 

main() 

GprCountZeroBits(); 

GprExtractBitField(); 

GprAndNot(); 

return 0; 


413 


CHAPTER 11 = AVX2 PROGRAMMING — EXTENDED INSTRUCTIONS 


; extern "C" void GprCountZeroBits (uint32 t x, uint32 t* lzcnt, uint32 t* tzcnt); 


; Requires: BMI1, LZCNT 
.code 

GprCountZeroBits proc 
lzcnt eax,ecx ;count leading zeros 
mov dword ptr [rdx],eax ;save result 
tzcnt eax,ecx ;count trailing zeros 
mov dword ptr [r8],eax ;save result 
ret 


GprCountZeroBits  endp 
; extern "C" uint32 t GprBextr (uint32 t x, uint8 t start, uint8 t length); 
D 


; Requires: BMI1 


GprBextr proc 


mov al,r8b 

mov ah,al ;ah = length 

mov al,dl ;al = start 

bextr eax,ecx,eax ;eax - extracted bit field (from x) 
ret 


GprBextr endp 


; extern "C" uint32 t GprAndNot (uint32 t x, uint32 t y); 


3 


; Requires: BMI1 


GprAndNot proc 
andn eax,ecx,edx ;eax = ^x & y 
ret 

GprAndNot_ endp 
end 


The C++ code in Listing 11-4 contains three short functions that set up test cases for the assembly 
language functions. The first function, GprCountZeroBits, initializes a test array that’s used to demonstrate 
the 1zcnt (Count the Number of Leading Zero Bits) and tzcnt (Count the Number of Trailing Zero Bits) 
instructions. The second function, GprExtractBitField, prepares test data for the bextr (Bit Field Extract) 
instruction. The final C++ function in Listing 11-4 is named GprAndNot. This function loads test arrays with 
data that’s used to illuminate execution of the andn (Bitwise AND NOT) instruction. 

The first assembly language function is named GprCountZeroBits_. This function uses the 1zcnt and 
tzcnt instructions to count the number of leading and trailing zero bits in their respective 32-bit wide source 
operands. The calculated bit counts are then saved in the specified destination operand. The next function, 
GprBextr_, exercises the bextr instruction. This instruction's first source operand contains the data from 
which the bit field will be extracted. Bits 7:0 and 15:8 of the second source operand specify the field’s starting 
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bit position and length, respectively. Finally, the function GprAndNot shows how to use the andn instruction. 
This instruction computes DesOp = ~SrcOp1 & SrcOp2 and is often employed to simplify Boolean masking 
operations. Here are the results for source code example Ch11_04. 


Results for AvxGprCountZeroBits 

X: 0x01000008 lzcnt: 7 tzcnt: 3 
0x00008000 lzcnt: 16 tzcnt: 15 
0x08000000 lzcnt: 4 tzcnt: 27 
0x00000001 Izcnt: 31 tzcnt: (0 
0x00000000 lzcnt: 32 tzcnt: 32 


X x x x 


Results for GprExtractBitField 

X: 0x12345678 start: 4 len: 16 bextr: 0x00004567 
x: 0x80808080 start: 7 len: 9 bextr: 0x00000101 
X: Oxfedcba98 start: 24 len: 8 bextr: Oxo00000fe 


Results for GprAndNot 

x: Oxfo000000f y: 0x12345678 andn: 0x02345670 
x: OxffOOffOO y: 0x12345678 andn: 0x00340078 
X: Oxaaaaaaaa y: Oxffaa5500 andn: 0x55005500 


The BMII and BMI2 instruction set extensions also include other enhanced bit-manipulation 
instructions that can be used to implement specific algorithms or carry out specialized operations. These 
instructions are listed in Table 8-5. 


Half-Precision Floating-Point Conversions 


The final source code example of this chapter, Ch11 05, exemplifies use of the half-precision conversion 
instructions vcvtps2ph and vcvtph2ps. Listing 11-5 shows the source code for this example. If your 
understanding of the half-precision floating-point data type is incomplete, you may want to review the 
content in Chapter 8 before perusing this section's source code and elucidations. 


Listing 11-5. Example Ch11_05 


#include "stdafx.h" 
#include «cstdint» 
#include «string» 

#include <iostream> 
#include «iomanip» 


using namespace std; 


extern "C" void SingleToHalfPrecision (uint16 t x hp[8], float x sp[8], int rc); 
extern "C" void HalfToSinglePrecision (float x sp[8], uint16 t x hp[8]); 
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int 


{ 
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main() 


float x[8]; 


x[0] = 4.125f; 
x[1] = 32.9f; 

x[2] = 56.3333f; 
x[3] = -68.6667f; 
x[4] = 42000.5f; 
x[5] = 75600.0f; 
x[6] = -6002.125f; 
x[7] = 170.0625f; 


uinti6 t x hp[8]; 
float rn[8], rd[8], ru[8], rz[8]; 


SingleToHalfPrecision (x hp, x, 0); 
HalfToSinglePrecision (rn, x hp); 
SingleToHalfPrecision (x hp, x, 1); 
HalfToSinglePrecision (rd, x hp); 
SingleToHalfPrecision (x hp, x, 2); 
HalfToSinglePrecision (ru, x hp); 
SingleToHalfPrecision (x hp, x, 3); 
HalfToSinglePrecision (rz, x hp); 


unsigned int w - 15; 
string line(76, '-'); 


cout «« fixed «« setprecision(4); 
cout << setw(w) << "x"; 

cout << setw(w) << "RoundNearest"; 
cout << setw(w) << "RoundDown"; 
cout << setw(w) << "RoundUp"; 

cout << setw(w) << "RoundZero"; 


cout << '\n' << line << '\n'; 


for (int i = 0; i < 8; i++) 

{ 
cout << setw(w) << x[i]; 
cout << setw(w) << rn[i] 
cout << setw(w) << rd[i] 
cout << setw(w) << ru[i] 
cout << setw(w) << rz[i] 
cout << ‘\n'; 


} 


return 0; 
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; extern "C" void SingleToHalfPrecision (uint16 t x hp[8], float x sp[8], int rc); 


.code 
SingleToHalfPrecision proc 


; Convert packed single-precision to packed half-precision 


vmovups ymmO,ymmword ptr [rdx] ;ymmO = 8 SPFP values 
cmp r8d,0 

jne GF 

vcvtps2ph xmm1,ymmo,O ;round to nearest 


jmp SaveResult 


@@: cmp r8d,1 
jne @F 
vcvtps2ph xmm1,ymmO, 1 ;round down 
jmp SaveResult 


(QQ: cmp r8d,2 
jne GF 
vcvtps2ph xmm1, ymmo, 2 ;round up 
jmp SaveResult 


@@: cmp r8d,3 
jne @F 
vcvtps2ph xmm1, ymmo, 3 3 truncate 
jmp SaveResult 


@@: vcvtps2ph xmm1, ymmo, 4 ;use MXCSR.RC 
SaveResult: 
vmovdqu xmmword ptr [rcx],xmmi ;save 8 HPFP values 
vzeroupper 
ret 


SingleToHalfPrecision  endp 

; extern "C" void HalfToSinglePrecision (float x sp[8], uint16 t x hp[8]); 
HalfToSinglePrecision proc 

; Convert packed half-precision to packed single-precision 


vcvtph2ps ymmo,xmmword ptr [rdx] 
vmovups ymmword ptr [rcx], ymmo ;save 8 SPFP values 
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vzeroupper 
ret 


HalfToSinglePrecision_ endp 
end 


The C++ function main starts its execution by loading single-precision floating-point test values into 
array x. It then exercises the half-precision floating-point conversion functions SingleToHalfPrecision 
and HalfToSinglePrecision .Note that the function SingleToHalfPrecision requires a third argument, 
which specifies the rounding mode to use when converting floating-point values from single-precision to 
half-precision. Also note that an array type of uint16 t is used to store the half-precision floating-point 
results since C++ does not natively support a half-precision floating-point data type. 

The assembly language function SingleToHalfPrecision uses the vcvtps2ph instruction to 
convert the eight single-precision floating-point values in array x sp to half-precision floating-point. 

This instruction requires an immediate operand that specifies the rounding mode to use during the type 
conversion. Table 11-4 shows the rounding mode options for the vcvtps2ph instruction. 


Table 11-4. Rounding Mode Options for the vcvtps2ph Instruction 


Immediate Operand Bits Value Description 
1:0 00b Round to nearest 
01b Round down (toward -c) 
10b Round up (toward +co) 
11b Round toward zero (truncate) 
2 0 Use rounding mode specified in immediate operand bits 1:0 
1 Use rounding mode specified in MXCSR.RC 
7:3 Ignored Notused 


The function HalfToSinglePrecision_ uses the vcvtph2ps instruction to convert eight half-precision 
floating-point values to single-precision floating-point. The output for source code example Ch11 05 follows 
this paragraph. Note the value differences between the various rounding modes. Also note that when the 
rounding mode RoundNearest or RoundUp is used, the value 76000.0f is converted to inf (or infinity) since 
this quantity exceeds the largest possible value for a half-precision floating-point value. 


x  RoundNearest RoundDown RoundUp RoundZero 

4.1250 4.1250 4.1250 4.1250 4.1250 
32.9000 32.9063 32.8750 32.9063 32.8750 
56.3333 56.3438 56.3125 56.3438 56.3125 
-68.6667 -68.6875 -68.6875 -68.6250 -68.6250 
42000.5000 42016.0000 41984.0000 42016.0000 41984.0000 
75600.0000 inf 65504.0000 inf 65504.0000 
-6002.1250 -6004.0000 -6004.0000 -6000.0000 -6000.0000 
170.0625 170.0000 170.0000 170.1250 170.0000 
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Summary 


Here are the key learning points from Chapter 11: 


FMA instructions are often employed to implement numerically-oriented algorithms 
such as discrete convolutions, which are used extensively in a wide variety of 
problem domains, including signal processing and image processing. 


Assembly language functions that employ sequences of consecutive FMA 
instructions should use multiple XMM or YMM registers to store intermediate sum 
products. Using multiple registers helps avoid data dependencies that can preclude 
the processor from executing multiple FMA instructions simultaneously. 


Value discrepancies normally occur between functions that implement the exact 
same algorithm or operation using FMA and non-FMA instruction sequences. The 
significance of these discrepancies depends on the application. 


Assembly language functions can use the mulx, sarx, shlx, and shrx instructions to 
carry out flagless unsigned integer multiplication and shifts. These instructions may 
yield slightly faster performance than their flag-based counterparts in algorithms 
that perform consecutive multiplication and shift operations. 


Assembly language functions can use the 1zcnt, tzcnt, bextr, and andn instructions 
to perform advanced bit-manipulation operations. 


Assembly language functions can use the vcvtps2ph and vcvtph2ps instructions 
to perform conversions between single-precision and half-precision floating-point 
values. 
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Advanced Vector Extensions 512 


In the previous eight chapters, you learned about the scalar floating-point, packed floating-point, and 
packed integer capabilities of AVX and AVX2. In this chapter, you'll learn about Advance Vector Extensions 
512 (AVX-512). AVX-512 is undoubtedly the largest and perhaps the most consequential extension of the x86 
platform to date. It doubles the number of available SIMD registers and broadens the width of each register 
from 256 to 512 bits. AVX-512 also extends the instruction syntax of AVX and AVX2 to support additional 
capabilities not available in the earlier extensions, including conditional execution and merging, embedded 
broadcasts, and instruction-level rounding control for floating-point operations. 

The content of this chapter is organized as follows. The first section presents a brief overview of 
AVX-512, which includes information about AVX-512's various instruction set extensions. This is followed 
by an examination of the AVX-512 execution environment, including its register sets, data types, instruction 
syntaxes, and enhanced computational features. The chapter concludes with a synopsis of the AVX-512 
instruction set extensions that are included in recently marketed processors for server and workstation 
platforms. 


AVX-512 Overview 


Unlike AVX and AVX2, AVX-512 is not a distinct instruction set extension. Rather, it's a congruous collection 
of interrelated instruction set extensions. An x86 processor is AVX-512 conforming if it supports the AVX512F 
(or foundation) instruction set extension. An AVX-512 conforming processor may optionally support 
additional AVX-512 instruction set extensions and these vary according to the processor's target market 
segment (e.g., high-performance computing, server, desktop, mobile, etc.). Table 12-1 lists the AVX-512 
instruction set extensions that are currently available in some Intel processors. This table also includes the 
AVX-512 instruction set extensions that Intel has announced for inclusion in future processors. As of the 
writing of this text, AMD does not market any processors that support AVX-512. 

The discussions in this chapter and the source code examples of Chapters 13 and 14 primarily focus 
on the AVX-512 instruction set extensions that are incorporated in Intel's Skylake Server microarchitecture, 
which was launched during 2017. This microarchitecture is used in Intel's Xeon Scalable (servers), Xeon 
W (workstations), and Core i7-7800X and i9-7900X series (high-end desktop) CPUs. Processors based on 
the Skylake Server microarchitecture contain the following AVX-512 instruction set extensions: AVX512F, 
AVX512CD, AVX512BW, AVX512DQ, and AVX512VL. Future mainstream processors from both AMD and 
Intel are expected to include these same AVX-512 extensions. Chapter 16 explains how to use the cupid 
instruction to detect the AVX-512 instructions set extensions that are shown in Table 12-1. 
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Table 12-1. Overview of AVX-512 Instruction Set Extensions 


CPUID Flag Description 

AVX512F Foundation instructions 

AVX512ER Exponential and reciprocal instructions 
AVX512PF Prefetch instructions 

AVX512CD Conflict detect instructions 

AVX512DQ Doubleword and quadword instructions 
AVX512BW Byte and word instructions 

AVX512VL 128-bit and 256-bit vector instructions 
AVX512_IFMA Integer fused-multiply-add 

AVX512_VBMI Additional vector byte instructions 
AVX512_4FMAPS Packed single-precision FMA (4 iterations) 
AVX512_4VNNI Vector neural network instructions (4 iterations) 
AVX512_VPOPCNTDQ vpopcnt[d|q] instructions 

AVX512_VNNI Vector neural net instructions 

AVX512_VBMI2 New vector byte, word, doubleword, and quadword instructions 
AVX512_BITALG vpopcnt[b|w] and vpshufbitqmb instructions 


AVX-512 Execution Environment 


AVX-512 augments the execution environment of the x86 platform with the addition of new registers and 
data types. It also extends the assembly language instruction syntax of AVX and AVX2 to support enhanced 
operations such as conditional executions and merging, embedded broadcasts, and instruction level 
rounding control. This section discusses these enhancements in greater detail. 


Register Sets 


Figure 12-1 illustrates the AVX-512 register sets. AVX-512 extends the width of each AVX SIMD register from 
256 bits to 512 bits. The 512-bit wide registers are known as the ZMM register set. AVX-512 conforming 
processors include 32 ZMM registers named ZMM0-ZMM31. The YMM and XMM register sets are aliased 
to the low-order 256 bits and 128 bits of each ZMM register, respectively. AVX-512 processors also include 
eight new opmask registers named K0-K7. These registers are primarily used as predicate masks to perform 
conditional executions and merging operations. They can also be employed as destination operands for 
instructions that generate vector mask results. You'll learn more about these registers later in this chapter. 
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Bit Position 


256 255 128 127 
| ZMMO/YMMO/XMMO 


| ZMM1/YMM1/XMMI 


| ZMM2/YMM2/XMM2 


| ZMM29/YMM29/XMM29 
| ZMM30/YMM30/XMM30 
i | ZMM31/YMM31/XMM31 
AVX-512 SIMD Register Set = 
Bit Position 
63 0 


AVX-512 Opmask Register Set 
Figure 12-1. AVX-512 register sets 


Data Types 


Similar to the YMM and XMM registers, software functions can use the ZMM registers to carry out SIMD 
operations using packed integer or packed floating-point operands. Table 12-2 shows the maximum number 
of elements that a ZMM register can hold for each supported data type. This table also shows the maximum 
number of elements that a YMM and XMM register can hold for comparison purposes. 
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Table 12-2. Maximum Number of Elements for AVX-512 Register Operands 


Data Type ZMM YMM XMM 
Integer byte 64 32 16 
Integer word 32 16 8 
Integer doubleword 16 8 4 
Integer quadword 8 4 2 
Single-precision floating-point 16 8 4 
Double-precision floating-point 8 4 2 


The alignment requirements for 512-bit wide operands in memory are similar to other x86 SIMD 
operands. Except for instructions that explicitly specify an aligned operand (e.g., vmovdqa[ 32164], 
vmovap[d|s], etc.), proper alignment of a 512-bit wide operand in memory is not mandatory. However, 512- 
bit wide operands should always be aligned on a 64-byte boundary whenever possible to avoid processing 
delays that can occur if the processor is forced to access an unaligned operand in memory. AVX-512 
instructions that access 256-bit or 128-bit wide operands in memory should also ensure that these types of 
operands are properly aligned on their respective natural boundaries. 


Instruction Syntax 


AVX-512 extends the instruction syntax of AVX and AVX2. Most AVX-512 instructions can use the same 
three-operand instruction syntax as AVX and AVX2 instructions, which consists of two non-destructive 
source operands and one destination operand. AVX-512 instructions can also exploit several new optional 
operands. These operands facilitate conditional executions and merging, embedded broadcast operations, 
and floating-point rounding control. The next few sections discuss AVX-512’s optional instruction operands 
in greater detail. 


Conditional Execution and Merging 


Most AVX-512 instructions support conditional execution and merging. A conditional execution and merge 
operation uses the bits of an opmask register as a predicate mask to control instruction execution and 
destination operand updates on a per-element basis. Figure 12-2 illustrates this concept in greater detail. In 
this figure, registers ZMMO, ZMM1, and ZMM2 each contain 16 single-precision floating-point values. The 
16 low-order bits of opmask register K1 constitute the predicate mask. When an opmask register is used in 
this manner, each bit controls how the result of corresponding element position in the destination operand 
is calculated and updated. 

Figure 12-2 also shows the outcome of three distinct executions of the vaddps instruction using the same 
initial values. The first example instruction, vaddps zmm2,zmmo,zmm1, performs a packed single-precision 
floating-point add of the elements in ZMMO and ZMM1 and saves the resultant sums in register ZMM2. 
Execution of this instruction is no different than an AVX vaddps instruction that uses XMM or YMM register 
operands. The next example instruction, vaddps zmm2{k1},zmmO, zmm1, illustrates how the bits of opmask 
register K1 are used to conditionally add and update the destination operand on a per-element basis. More 
specifically, an element sum is calculated and saved in the destination operand only if the corresponding bit 
position of the opmask register is set to one; otherwise, the destination operand element position remains 
unchanged. This is called merge masking. The final example instruction in Figure 12-2, vaddps zmm2{k1} 
{z}, zmmo, zmm1, is similar to the previous instruction. The extra {z} operand instructs the processor to 
perform zero masking instead of merge masking. Zero masking sets a destination operand element to zero if 
its corresponding bit position in the opmask register is set to zero; otherwise, the sum is calculated and saved. 


424 


CHAPTER 12 = ADVANCED VECTOR EXTENSIONS 512 


Initial values 


zmm2 


9 | k1 (bits 15:0) 


vaddps zmm2{k1},zmm0,zmml ;packed add - merge masking 


vaddps zmm2í(k1)íz]),zmmO0,zmml 


Figure 12-2. Execution examples of the vaddps instruction using no masking, merge masking, and zero masking 


At this point a few words about the opmask registers are warranted. The eight opmask registers are 
somewhat like the general-purpose registers. On processors that support AVX-512, each opmask register 
is 64-bits wide. However, when employed as a predicate mask, only the low-order bits are used during 
instruction execution. The exact number of used low-order bits varies depending on the number of 
vector elements. In Figure 12-2, bits 0-15 of opmask register K1 form the predicate mask since the vaddps 
instruction employs ZMM register operands that contain 16 single-precision floating-point values. 

AVX-512 includes several new instructions that can be used to read values from and write values to an 
opmask register and perform Boolean operations. You'll learn about these instructions later in this chapter. 
An opmask register can also be used as destination operand with instructions that generate a vector mask 
result such as vcmpp[d|s] and vpcmp[b|w|d|q]. The source code examples in Chapters 13 and 14 illustrate 
how to use these instructions with an opmask register. AVX-512 instructions can use opmask registers K1-K7 
as a predicate mask. Opmask register KO cannot be employed as a predicate mask operand but it can be used 
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in any instruction that requires a source or destination operand opmask register. If an AVX-512 instruction 
attempts to use K0 as a predicate mask, the processor substitutes an implicit operand of all 1s, which 
disables all conditional execution and masking operations. 


Embedded Broadcast 


Many AVX-512 instructions can carry out a SIMD computation using an embedded broadcast operand. An 
embedded broadcast operand is a memory-based scalar value that is replicated N times into a temporary 
packed value, where N represents the number of vector elements referenced by the instruction. This 
temporary packed value is then used as an operand in a SIMD calculation. 

Figure 12-3 contains two example instruction sequences that illustrate broadcast operations. The first 
example uses the vbroadcastss instruction to load the single-precision floating-point constant 2.0 into 
each element position of ZMM1. The ensuing vmulps zmm2,zmmO, zmm1 instruction multiplies each value 
in ZMMO by 2.0 and saves the results to ZMM2. The second example instruction in Figure 12-3, vmulps 
zmm2,zmmO,real4 bcst [rax], carries out this same operation using an embedded broadcast operand. The 
text real4 bcst isa MASM directive that instructs the assembler to treat the memory location pointed to by 
register RAX as an embedded broadcast operand. 


Initial values rax 
EXCESS ES ES ESESESESESESESES ES E S 


Example #1 


vbroadcastss zmml,real4 ptr [rax] 


CORRE 


vmulps zmm2, zmm0, zmml 


Peppe eT eT Te Te 


Example #2 


vmulps zmm2,zmm0,real4 bcst [rax] 


pes [a pen pos Fen ese poco es | afe [noi rome 


Figure 12-3. Packed single-precision floating-point multiplication using the vbroadcastss and vmulps 
instructions versus a vmulps instruction with an embedded broadcast operand 


AVX-512 supports embedded broadcast operations using 32-bit and 64-bit wide elements. Embedded 
broadcasts cannot be performed using 8-bit and 16-bit wide elements. 
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Instruction Level Rounding 


The final AVX-512 instruction syntax enhancement involves instruction-level rounding control for floating- 
point operations. In Chapter 5, you learned how to use the vldmxcsr and vstmxcsr instructions to change 
the processor's global rounding mode for floating-point operations (see example Ch05 06). AVX-512 allows 
some instructions to specify a floating-point rounding mode operand that overrides the current rounding 
mode in MXCSR.RC. Table 12-3 shows the supported rounding mode operands, which are also called static 
rounding modes. The -sae suffix that's appended to each static rounding mode operand string is an acronym 
for suppress all exceptions. This suffix serves as a reminder that floating-point exceptions are always masked 
whenever a static rounding mode operand is specified; MXCSR flag updates are also disabled. 


Table 12-3. AVX-512 Instruction-Level Static Rounding Mode Operands 


Rounding Mode Operand Description 

{rn-sae} Round to nearest 

{rd-sae} Round down (toward —co) 
{ru-sae} Round up (toward +co) 
{rz-sae} Round toward zero (truncate) 


Static rounding mode operands can be used with many (but not all) AVX-512 instructions that perform 
floating-point operations using 512-bit wide packed operands; 256-bit and 128-bit wide packed operands 
are not supported. Static rounding mode operands can also be used with instructions that perform scalar 
floating-point operations. In both use cases, all instruction operands must be registers. For example, the 
instructions vnulps zmm2,zmmO,zmm1 (rz-sae) and vmulss xmm2,xmmO,xmm1 {rz-sae} are valid, whereas 
vmulps zmm2,zmmO,zmmword ptr [rax] {rz-sae} and vmulss xmm2,xmmO,real4 ptr [rax] {rz-sae} are 
invalid. Some AVX-512 floating-point instructions do not support the specification of a static rounding mode 
operand, but these instructions still can use the operand {sae} to suppress all exceptions. 


Instruction Set Overview 


This section presents an overview of the following AVX-512 instruction set extensions: AVX512F, AVX512CD, 
AVX512BW, and AVX512DQ. It also includes a summary of the opmask register instructions. The tables in 
this section only include instructions that are new to AVX-512. They do not include instructions that are a 
simple promotion of an existing AVX or AVX2 instruction. Most of the instructions in these tables can be 
used with 512-bit wide operands; 256-bit and 128-bit wide operands can be used on processors that support 
AVX512VL. 


AVX512F 


Table 12-4 lists the AVX512F instructions. As mentioned in the overview section of this chapter, all AVX-512 
conforming processors must minimally support the instructions that are included in this table. 


427 


CHAPTER 12 ™ ADVANCED VECTOR EXTENSIONS 512 


Table 12-4. AVX512F Instruction Set Overview 


Mnemonic 


Description 


valign[d|q] 
vblendmp[d|s] 
vbroadcastf[32x4|64x4] 
vbroadcasti[32x4|64x4] 
vcompressp[d|s] 
vcvtp[d|s]2udq 
vcvts[d|s]2usi 
vcvttp[d|s]2udq 


vcvtts[d|s]2usi 


vcvtudg2p[d|s] 
vcvtusi2s[d|s] 
vexpandp[d| s] 
vextractf[32x4|64x4] 
vextracti[32x4|64x4] 
vfixupimmp[d|s] 
vfixupimms[d|s] 
vgetexpp[d|s] 
vgetexps[d|s] 
vgetmantp[d|s] 
vgetmants [d|s] 
vinsertf[32x4|64x4] 
vinserti[32x4|64x4] 
vmovdqa[32 |64] 
vmovdqu[32 |64] 
vpblendm[d|q] 
vpbroadcast[d|q] 
vpcmp[d|q] 
vpcnpu[d|q] 
vpcompress[d|q] 
vpermi2[d|q|ps|pd] 
vpermt2[d|q|ps|pd] 
vpmov [db | sdb |usdb] 
vpexpand[d|q] 


Align doubleword | quadword vectors 

Blend floating-point vectors using opmask control 

Broadcast floating-point tuples 

Broadcast integer tuples 

Store sparse packed floating-point values 

Convert packed floating-point to packed unsigned doubleword integers 
Convert scalar floating-point to unsigned doubleword integer 


Convert packed floating-point to packed unsigned doubleword integers 
with truncation 


Convert scalar floating-point to unsigned doubleword integer with 
truncation 


Convert packed unsigned doubleword integers to packed floating-point 
Convert unsigned doubleword integer to floating-point 
Load sparse packed floating-point values 

Extract packed floating-point values 

Extract packed integer values 

Fix up special packed floating-point values 

Fix up special scalar floating-point values 

Convert exponents of packed floating-point values 
Convert exponents of scalar floating-point values 

Get normalized mantissas from packed floating-point values 
Get normalized mantissas from scalar floating-point value 
Insert packed floating-point values 

Insert packed integer values 

Move aligned packed integers 

Move unaligned packed integers 

Blend packed integers using opmask control 

Broadcast integer from general-purpose register 
Compare packed signed integers 

Compare packed unsigned integers 

Store sparse packed integers 

Permute from two tables overwriting the index 

Permute from two tables overwriting one table 

Down convert packed doublewords to packed bytes 


Load sparse packed integers 
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Mnemonic 

vpmax[s |u]q 
vpmin[s|u]q 

vpmov[db | sdb |usdb 
vpmov [ dw | sdw | usdw 
vpmov 


[ 
[ 
[ 
[ 
vpmov[ qd | sqd |usqd 

[ 


] 
] 
qb|sqb|usqb] 
] 
vpmov [qw| sqw | usqu] 
vprol[d|q] 
vprolv[d|q] 
vpror[d|q] 
vprorv[d|q] 
vpscatterd[d|q] 
vpscatterq[d|q] 
vpsraq 

vpsravq 
vpternlog[d|q] 
vptestm[d|q] 
vptestnm[d|q] 
vrcp14p[d|s] 
vrcp14s[d|s] 
vreducep[d|s] 
vreduces[d|s] 
vrndscalep[d|s] 
vrndscales[d|s] 
vrsqrti4p[d|s] 
vrsqrt14s[d|s] 
vscalefp[d|s] 
vscalefs[d|s] 
vscatterdp[d|s] 
vscatterqp[d|s] 
vshuff[32x4|64x2] 
vshufi[32x4|64x2] 


Description 


Calculated packed quadword maximums 

Calculate packed quadword minimums 

Down convert packed doublewords to packed bytes 

Down convert packed doublewords to packed words 

Down convert packed quadwords to packed bytes 

Down convert packed quadwords to packed doublewords 

Down convert packed quadwords to packed words 

Rotate left packed integers using constant count 

Rotate left pack integers using variable counts 

Rotate right packed integers using constant count 

Rotate right packed integers using variable counts 

Scatter packed integers using doubleword indices 

Scatter packed integers using quadword indices 

Shift right arithmetic packed quadword integers using constant count 
Shift right arithmetic packed quadword integers using variable counts 
Bitwise ternary logic 

Packed integer bitwise AND and set mask 

Packed integer bitwise NAND and set mask 

Compute approximate reciprocals of packed floating-point values 
Compute approximate reciprocals of scalar floating-point value 
Perform reduction transformation on packed floating-point values 
Perform reduction transformation on scalar floating-point value 
Round packed floating-point values to number of fractional bits 
Round floating-point value to number of fractional bits 

Compute approximate reciprocals of packed floating-point square roots 
Compute approximate reciprocals of scalar floating-point square root 
Scale packed floating-point values 

Scale scalar floating-point value 

Scatter packed floating-point values using doubleword indices 
Scatter packed floating-point values using quadword indices 

Shuffle packed floating-point values 


Shuffle packed integer values 
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AVX512CD 


Table 12-5 lists the AVX512CD instructions. These instructions are frequently used to detect and mitigate 
data dependencies that can occur when performing sparse array calculations or scatter operations. They can 
also be used with other AVX-512 instructions to perform ordinary computations. 


Table 12-5. AVX512CD Instruction Set Overview 


Mnemonic 


Description 


vpbroadcastm[b2q|w2d] 


vpconflict[d|q] 
vplzcnt[d|q] 


Broadcast mask to vector register 
Detect conflicts within packed integers 


Count number of leading zeros in packed integers 


AVX512BW 


Table 12-6 lists the AVX512BW instructions. These instructions carry out their operations using packed byte 


and word operands. 


Table 12-6. AVX512BW Instruction Set Overview 


Mnemonic 


Description 


vdbpsadbw 
vmovdq[u8 | u16] 
vpblendm[b|w] 
vpbroadcast[b|w] 
vpcmp[b|w] 
vpcmpu[b |w] 
vpermw 

vpermi2w 
vpermt2w 
vpmov[b |w]2m 
vpmovm2 [b |w] 
vpmovw[b |sb |usb] 
vpsllvw 

vpsravw 

vpsrlvw 
vptestm[b|w] 
vptestnm[b|w] 


Double block packed sum-absolute-differences using unsigned bytes 
Move unaligned packed integers 

Blend packed integers using opmask control 

Broadcast integer from general-purpose register 

Compare packed signed integers 

Compare packed unsigned integers 

Permute packed words 

Permute word integers from two tables overwriting the index 
Permute word integers from two tables overwriting one table 
Convert vector register to mask register 

Convert mask register to vector register 

Down convert packed words to packed bytes 

Packed word shift left logical using variable bit counts 
Packed word shift right arithmetic using variable bit counts 
Packed word shift right logical using variable bit counts 
Packed integer bitwise AND and set mask 

Packed integer bitwise NAND and set mask 
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Table 12-7 lists the AVX512DQ instructions. These instructions carry out their operations using packed 
doubleword and quadword operands. AVX512DQ also includes instructions that perform conversions 
between packed floating-point and integer quadwords. 


Table 12-7. AVX512DQ Instruction Set Overview 


Mnemonic 


vcvtp[d|s]2qq 
vcvtp[d|s]2uqq 
vcvttp[d|s]2qq 
vcvttp[d|s]2uqq 
vcvtuqq2p[d|s] 
vextractf64x2 
vextracti64x2 
vfpclass[pd|ps] 
vfpclass[sd|ss] 
vinsertf64x2 
vinserti64x2 
vpmov[d|q]2m 
vpmovm2 [d | q] 
vpmullq 
vrangep[d|s] 
vranges[d|s] 
vreducep[d|s] 


vreduces [d|s] 


Description 


Convert packed floating-point to signed quadword integers 

Convert packed floating-point to unsigned quadword integers 

Convert packed floating-point to signed quadword integers with truncation 
Convert packed floating-point to unsigned quadword integers with truncation 
Convert packed unsigned quadword integers to floating-point 

Extract packed double-precision floating-point values 

Extract packed quadword values 

Test packed floating-point class 

Test scalar floating-point class 

Insert packed double-precision floating-point values 

Insert packed quadword values 

Convert vector register to mask register 

Convert mask register to vector register 

Multiply packed quadword integers and store low result 

Range restriction calculation for packed floating-point 

Range restriction calculation for scalar floating-point 

Perform reduction on packed floating-point values 


Perform reduction on scalar floating-point values 


Opmask Registers 


Table 12-8 lists the opmask register instructions. The word versions of these instructions require AVX512F 
except for kaddw and ktestw, which require AVX512DQ. The doubleword and quadword versions of the 
opmask register instructions require AVX512BW; the byte versions require AVX512DQ. 
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Table 12-8. Opmask Register Instruction Set Overview 


Mnemonic Description 

kadd[b|w|d|q] Add mask values 

kand[b|w|d|q] Bitwise AND 

kandn[b|w|d|q] Bitwise AND NOT 

kmov[b|w|d|q] Move value to/from opmask register 

knot[b|w|d|q] Bitwise NOT 

kor[b|w|d|q] Bitwise inclusive OR 

kortest[b|w|d|q] Bitwise inclusive OR; update RFLAGS.ZF and RFLAGS.CF 
kshiftl[b|w|d|q] Shift left 

kshiftr[b|w|d|q] Shift right 

ktest[b|w|d|q] Bitwise AND and ANDN; update RFLAGS.ZF and RFLAGS.CF 
kunpck [bw |wd | dq] Unpack 

kxnor[b|w|d|q] Bitwise exclusive NOR 

kxor[b|w|d|q] Bitwise exclusive OR 


Summary 


Here are the key learning points for Chapter 12: 


e — All AVX-512 conforming processors support the AVX512F instruction set extension. 
Inclusion of additional AVX-512 instruction set extensions varies depending on the 
processor's target market. 


e The AVX-512 register set includes 32 512-bit wide registers named ZMM0-ZMM31. 
The low-order 256 and 128 bits are aliased to registers YMM0-YMM31 and XMMO- 
XMM31, respectively. 


e The AVX-512 register set also includes eight opmask registers named K0-K7. Opmask 
registers K1-K7 can be used to perform instruction-level conditional executions with 
merge masking or zero masking. 


e Many AVX-512 instructions that require a packed operand of constant values can use 
an embedded broadcast operand instead of a separate broadcast instruction. 


e A static rounding mode operand can be specified with many AVX-512 instructions 
that perform floating-point operations using 512-bit wide packed or scalar floating- 
point register operands. 
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CHAPTER 13 


AVX-512 Programming - 
Floating-Point 


In previous chapters, you learned how to carry out scalar and packed floating-point operations using 

the AVX and AVX2 instruction sets. In this chapter, you learn how to perform these operations using the 
AVX-512 instruction set. The first part of this chapter contains source code examples that illustrate basic 
AVX-512 programming concepts using scalar floating-point operands. This includes examples that illustrate 
conditional executions, merge and zero masking, and instruction-level rounding. The second part of this 
chapter demonstrates how to use the AVX-512 instruction set to carry out packed floating-point calculations 
using 512-bit wide operands and the ZMM register set. 

The source code examples of this chapter require a processor and operating system that support 
AVX-512 and the following instruction set extensions: AVX512F, AVX512CD, AVX512BW, AVX512DQ, and 
AVX512VL. As discussed in Chapter 12, these extensions are supported by processors that are based on 
the Intel Skylake Server microarchitecture. Future processors from both AMD and Intel are also likely to 
incorporate the previously-mentioned instruction set extensions. You can use one of the freely available 
utilities listed in Appendix A to determine which AVX-512 instruction sets your system supports. 

In Chapter 16, you learn how to use the cupid instruction to detect specific AVX-512 instruction set 
extensions at runtime. 


Scalar Floating-Point 


AVX-512 extends the scalar floating-point capabilities of AVX to include merge masking, zero masking, 
and instruction-level rounding control. The source code examples of this section explain how to use these 
capabilities. They also exemplify some minor differences that you need to be aware of when writing scalar 
float-point code using AVX-512 instructions. 


Merge Masking 


Listing 13-1 shows the source code for example Ch13 01. This example describes how to perform merge 
masking using AVX-512 scalar floating-point instructions. It also illustrates the use of several opmask register 
instructions. 
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Listing 13-1. Example Ch13 01 


#include "stdafx.h" 
#include «string» 
#include <iostream> 
#include «iomanip» 
#include «limits» 

#define USE MATH DEFINES 
#include «math.h» 


using namespace std; 

extern "C" double g PI - M PI; 

extern "C" bool Avx512CalcSphereAreaVol (double* sa, double* vol, double radius, double 
error val); 

bool Avx512CalcSphereAreaVolCpp(double* sa, double* vol, double radius, double error val) 
{ 


bool rc; 


if (radius < 0.0) 


! *sa = error val; 
*vol = error val; 
rc = false; 
} 
else 
{ 
*sa = 4.0 * g PI * radius * radius; 
*vol = *sa * radius / 3.0; 
rc = True; 
) 
return rc; 
) 
int main() 
( 


const double error val = numeric limits«double»::quiet NaN(); 
const double radii[] = {-1.0, 0.0, 1.0, 2.0, 3.0, 4.0, -7.0, 10.0, -18.0, 20.0); 
int num r - sizeof(radii) / sizeof(double); 


string sp(" "}; 
string sep(75, '-'); 


cout «« setw(10) «« "radius" «« sp; 


cout << setw(6) << "rca" << sp; 
cout «« setw(6) «« "rc2" «« sp; 


434 


cout 
cout 
cout 
cout 
cout 


cout 


<< 
<< 
<< 
<< 
<< 


<< 


setw(10) << "sal" << sp; 
setw(10) << "sa2" << sp; 
setw(10) << "voli" << sp; 
setw(10) << "vol2" << 'An'; 
sep << ‘\n'; 


fixed << setprecision(4); 


for (int i = 0; i < num r; i++) 


{ 


} 


doub 
doub 
doub 


bool 
bool 


cout 
cout 
cout 
cout 
cout 
cout 
cout 


le sa1, sa2; 
le voli, vol2; 
le r = radii[i]; 


rci 
rc2 


«« setw(10) «« r «« sp; 
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Avx512CalcSphereAreaVolCpp(&sa1, 8voli, r, error val); 
Avx512CalcSphereAreaVol (8sa2, &vol2, r, error val); 


<< setw(6) << boolalpha << rc1 << sp; 
«« setw(6) «« boolalpha «« rc2 «« sp; 


<< setw(10) << sal << sp; 
<< setw(10) << sa2 << sp; 
<< setw(10) << voli << sp; 
<< setw(10) << vol2 << '\n'; 


return 0; 


r8 three 
r8 four 


include «cmpequ.asmh» 
.const 

real8 3.0 

real8 4.0 


extern g PI:real8 


; extern "C" bool Avx512CalcSphereAreaVol (double* sa, double* v, double r, double error val); 


b 
; Return 


S: 


.cod 


false - invalid radius, true - valid radius 


e 


Avx512CalcSphereAreaVol_ proc 


; Test radius for value >= 0.0 


vmov 
VXOI 
vmov 
vcmp 


sd xmmO, xmmO , xmm2 

pd xmm5 , xmmb5 , xmm5 

sd xmm16,xmm16, xmm3 

sd k1,xmmo,xmm5,CMP GE 


;XxmmO = radius 

;xmm5 = 0.0 

;xmm16 = error val 

;k1[0] = 1 if radius >= 0.0 
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; Calculate surface area and volume using mask from compare 


vmulsd xmmi(k1), xmmo,xmmo ;xmm1 - r*r 

vmulsd xmm2{k1},xmm1,[r8 four] ;mm -4*r*r 

vmulsd xmm3{k1},xmm2, [g_PI] ;xmm3 = 4 * PI * r * r (sa) 

vmulsd xmm4(k1), xmm3 , xmmo ;xmm4-24*PI*r*r*r 

vdivsd xmm5{k1},xmm4, [18 three] 3xmms = 4 * PI * yr * x * x / 3 (vol) 


; Set surface area and volume to error val if radius < 0.0 is true 


knotw k2,k1 ;k2[0] = 1 if radius < 0.0 
vmovsd xmm3{k2},xmm3,xmm16 ;xmm3 = error val if radius « 0.0 
vmovsd xmm5{k2},xmm5, xmm16 ;xmm5 = error val if radius « 0.0 


; Save results 


vmovsd real8 ptr [rcx],xmm3 ;save surface area 
vmovsd real8 ptr [rdx],xmm5 ;save volume 
kmovw eax,k1 ;eax - return code 
ret 

Avx512CalcSphereAreaVol_ endp 
end 


The C++ code in Listing 13-1 starts with the function Avx512CalcSphereAreaVolCpp. This function 
calculates the surface area and volume of any sphere whose radius is greater or equal to zero. If the 
sphere's radius is less than zero, Avx512CalcSphereAreaVolCpp sets the surface area and volume to 
error_val. The remaining C++ code in Listing 13-1 performs test case initialization, exercises the functions 
Avx512CalcSphereAreaVolumeCpp and Avx512CalcSphereAreaVolume_, and streams results to cout. 

The assembly language function Avx512CalcSphereAreaVol implements the same algorithm as its 
C++ counterpart. This function begins with a vmovsd xmmO,xmmO, xmm2 instruction that copies argument 
value r to register XMMO. It then loads register XMM5 with 0.0. The vmovsd xmm16, xmm16 , xmm3 instruction 
copies error valinto register XMM16. According to the Visual C++ calling convention, the new AVX-512 
registers ZMM16-ZMM31 along with the low-order YMM and XMM counterparts are volatile across function 
boundaries. This means that these registers can be used by any assembly language function without 
preserving their values. The next instruction, vcmpsd k1,xmmo, xmm5, CMP GE, sets opmask register bit K1[0] 
to one if r is greater than or equal to zero; otherwise, this bit is set to zero. 

The first instruction of the surface area and volume calculation code block, vmulsd 
xmm1 (k1) , xmmO, xmmO0, computes r * rifbit K1[0] is setto one (r >= 0.0 is true); it then saves the calculated 
product in XMMI [63:0]. If bit K1[0] is set to zero (r < 0.0 is true), the processor skips the double-precision 
floating-point multiplication calculation and leaves register XMM1 unaltered. The next instruction, 
vmulsd xmm2{k1},xmm1,[r8 four], computes 4.0 * r * rusingthe same merge masking operation as 
the previous instruction. The ensuing vmulsd and vdivsd instructions complete the required surface area 
(XMM3) and volume (XMM5) calculations. The merge masking operations in this code block exemplify 
one of AVX-512's key computational capabilities: the processor carries out the double-precision floating- 
point arithmetic calculations only if bit K1[0] is set to one; otherwise no calculations are performed, and the 
respective destination operand registers remain unchanged. 

Following the surface area and volume calculations, the knotw k2,k1 negates the low-order 16 bits of 
K1 and saves this result to K2[15:0]. This instruction also sets bits K2[63:16] to zero. Bit K2[0] is now set to 
oneifr < O.Oistrue. The knotw instruction is used here since it's part of AVX512F instruction set extension; 
knot[b|d|q] would also work here. The next instruction, vmovsd xmm3(k2) , xmm3 , xmm16, sets the surface 
area to error valifr < 0.0is true. The subsequent vmovsd xmm5(k2] , xmm5 , xmn16 instruction performs the 
same operation for the volume value. The final kmovw eax,k1 instruction loads EAX with the function return 
code. Here are the results for source code example Ch13 01: 
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radius rci rc2 sal sa2 vol1 vol2 
-1.0000 false false nan nan nan nan 
0.0000 true true 0.0000 0.0000 0.0000 0.0000 
1.0000 true true 12.5664 12.5664 4.1888 4.1888 
2.0000 true true 50.2655 50.2655 33.5103 33.5103 
3.0000 true true 113.0973 113.0973 113.0973 113.0973 
4.0000 true true 201.0619 201.0619 268.0826 268.0826 
-7.0000 false false nan nan nan nan 
10.0000 true true 1256.6371 1256.6371 4188.7902 4188.7902 
-18.0000 false false nan nan nan nan 


20.0000 true true 5026.5482 5026.5482 33510.3216 33510.3216 


Zero Masking 


The next source code example is named Ch13_02. This example demonstrates how to use zero masking to 
eliminate data-dependent conditional jumps from a calculation. Listing 13-2 shows the source code for this 
example. 


Listing 13-2. Example Ch13_02 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 
#include <array> 
#include <random> 


using namespace std; 
extern "C" bool Avx512CalcValues_(double* c, const double* a, const double* b, size t n); 
template«typename T» void Init(T* x, size t n, unsigned int seed) 


uniform int distribution«» ui dist (1, 200}; 
default random engine rng {seed}; 


for (size t i = 0; i< n; i++) 
x[i] = (T)(ui dist(rng) - 25); 
) 


bool Avx512CalcValuesCpp(double* c, const double* a, const double* b, size t n) 


( 
if (n == 0) 
return false; 
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for (size t i = 0; i< n; i++) 


double val = a[i] * b[i]; 
c[i] = (val >= 0.0) ? sqrt(val) : val * val; 


) 
return true; 
) 
int main() 
( 
const size t n - 20; 
array«double, n» a; 
array«double, n» b; 
array«double, n» c1; 
array«double, n» c2; 
Init«double»(a.data(), n, 13); 
Init«double»(b.data(), n, 23); 
bool rci = Avx512CalcValuesCpp(c1.data(), a.data(), b.data(), n); 
bool rc2 = Avx512CalcValues (c2.data(), a.data(), b.data(), n); 
if (Irei || !rc2) 
{ 
cout << "Invalid return code - "; 
cout << "rci = " << boolalpha << rci << " "5 
cout << "rc2 = " << boolalpha << re2 << '\n'; 
} 
else 
{ 
cout << fixed << setprecision(4); 
for (size t i = 0; i< n; i++) 
{ 
cout << "i: " << setw(2) << i<<" "; 
cout << "a: " << setw(9) << afi] <<" "; 
cout << "b: " << setw(9) << b[i] <<" "; 
cout << "c1: " << setw(13) << c1[i] «« " "; 
cout << "c2: " << setw(13) << c2[i] << "\n"; 
} 
} 
} 
> mm 
3 Ch13 02.asm 


include «cmpequ.asmh» 


; extern "C" bool Avx512CalcValues (double* c, const double* a, const double* b, size t n); 
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.code 
Avx512CalcValues_ proc 


; Validate n and initialize array index i 


xor eax,eax ;set error return code (also i = 0) 
test r9,r9 jis n == 0? 

jz Done ;jump if n is zero 

vxorpd xmm5,xmm5, xmm5 ;xmm5 = 0.0 


; Load next a[i] and b[i], calculate val 


@@: vmovsd xmmO,real8 ptr [rdx+rax*8]  ;xmmO = a[i]; 
vmovsd xmm1,real8 ptr [r8+rax*8] ;xmm1 = b[i]; 
vmulsd xmm2, xmmo, xmm1 ;val - a[i] * b[i] 
; Calculate c[i] = (val >= 0.0) ? sqrt(val) : val * val 
vcmpsd k1,xmm2,xmm5,CMP GE ;ska[0] = 1 if val >= 0.0 
vsqrtsd xmm3{k1}{z},xmm3,xmm2 ;xmm3 = (val > 0.0) ? sqrt(val) : 0.0 
knotw k2,k1 ;k2[0] = 1 if val « 0.0 
vmulsd xmm4{k2}{2},xmm2,xmm2 ;xmm4 - (val « 0.0) ? val * val : 0.0 


vorpd xmmo, xmm4,xmm3 ;xmmO = (val >= 0.0) ? sqrt(val) : val * val 
vmovsd real8 ptr [rcx«rax*8],xmmO ^ ;save result to c[i] 


; Update index i and repeat until done 


inc rax si 421 

cmp rax,r9 

jl @B 

mov eax,1 ;set success return code 


Done: ret 
Avx512CalcValues_ endp 
end 


In the C++ code, the function Avx512CalcValuesCpp performs a simple arithmetic calculation using 
double-precision floating-point arrays. Each loop iteration begins by calculating the intermediate value val 
= a[i] * b[i]. The next statement, c[i] = (val >= 0.0) ? sqrt(val) : val * val, loads c[i] with 
a quantity that varies depending on the value of val. The assembly language function Avx512CalcValues_ 
also performs the same computation. The C++ function main contains code that initializes the test arrays, 
exercises the functions Avx512CalcValuesCpp and Avx512CalcValues , and displays the results. 

The processing loop of Avx512CalcValues begins with two vmovsd instructions that load a[i] and 
b[i]into registers XMMO0 and XMMI, respectively. The ensuing vmulsd xmm2,xmmO, xmm1 instruction 
computes the intermediate product val - a[i] * b[i].Following the calculation of val, the vcmpsd 
k1,xmm2,xmm5,CMP GE instruction compares val against 0.0 and sets bit K1[0] to one if val is greater than 
or equal to zero; otherwise bit K1[0] is set to zero. The next instruction, vsqrtsd xmm3{k1}{z},xmm3,xmm2, 
calculates the square root of val if K1[0] is set to one and saves the result in XMM3. If K1[0] is zero, the 
processor skips the square root calculation and sets register XMMS to 0.0. 

The knotw k2,k1 instruction sets K2[0] to one if val is less than 0.0. The ensuing vnulsd xmm4{k2} 
{z}, xmm2, xmm2 instruction calculates and saves the product val * val in XMMA if bit K2[0] is set to one; 
otherwise XMM4 is set equal to 0.0. Following execution of the vmulsd instruction, register XMM3 contains 
sqrt(val) and XMM4 contains 0.0, or XMM3 contains 0.0 and XMM4 holds val * val. These register 
values facilitate use of a vorpd xmmo, xmm4 , xmm3 instruction to load XMMO with the final value for c [1]. 
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Like the previous source code example, the function Avx512CalcValues demonstrates an important 
capability of AVX-512. The use of zero masking and some simple Boolean logic allowed Avx512CalcValues | 
to make logical decisions sans any conditional jump instructions. This is noteworthy since data-dependent 
conditional jump instructions are often slower than straight-line code. Here is the output for source code 
example Ch13 02: 


i: 0 à: -6.0000 b: 67.0000 c1: 161604.0000 c2: 161604.0000 
i: 1 a: 128.0000 b: 22.0000 c1: 53.0660 c2: 53.0660 
i: 2 at 130.0000 b: -8.0000 c1: 1081600.0000 c2: 1081600.0000 
1: 3 a: 152.0000 b: 73.0000 c1: 105.3376 c2: 105.3376 
i: 4 a: 94.0000 b: 6.0000 c1: 23.7487 c2: 23.7487 
1: 5 a: 2.0000 b: 88.0000 c1: 13.2665 c2: 13.2665 
i; 6 a: 12.0000 b: 103.0000 c1: 35.1568 c2: 35.1568 
i; 7 a: 105.0000 b: 117.0000 c1: 110.8377 c2: 110.8377 
i: 8 a: 140.0000 b: -20.0000 c1: 7840000.0000 c2: 7840000.0000 
i: 9 a: 74.0000 b: 3.0000 c1: 14.8997 c2: 14.8997 
13. 430" aś 43.0000 b: -9.0000 c1: 149769.0000 c2: 149769.0000 
Te AT, cas 2.0000 b: 122.0000 c1: 15.6205 c2: 15.6205 
i: 12 a: 36.0000 b: 9.0000 c1: 18.0000 c2: 18.0000 
i: 13 a: -18.0000 b: 123.0000 c1: 4901796.0000 c2: 4901796.0000 
i: 14 a: 170.0000 b: 134.0000 c1: 150.9304 c2: 150.9304 
ae: 4156, ss 102.0000 b: 3.0000 c1: 17.4929 c2: 17.4929 
1% 16 ä: 118.0000 b: -19.0000 c1: 5026564.0000 c2: 5026564.0000 
i: 17 a: 85.0000 b: 148.0000 c1: 112.1606 c2: 112.1606 
i: 18 a: 61.0000 b: 65.0000 c1: 62.9682 c2: 62.9682 
i: 19 a: 18.0000 b: 74.0000 c1: 36.4966 c2: 36.4966 


Instruction-Level Rounding 


The final source code example of this section, Ch13_03, explains how to use instruction-level rounding 
operands. It also illustrates use of the AVX-512 instructions that perform conversions between floating-point 
and unsigned integer values. Listing 13-3 shows the source code for example Ch13_03. 


Listing 13-3. Example Ch13_03 


#include "stdafx.h" 
#include <cstdint> 
#include <iostream> 
#include <iomanip> 
#define USE MATH DEFINES 
#include «math.h» 


using namespace std; 


extern "C" void Avx512CvtF32ToU32 (uint32 t val cvt[4], float val); 
extern "C" void Avx512CvtF64ToU64 (uint64 t val cvt[4], double val); 
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extern "C" void Avx512CvtF64ToF32 (float val cvt[4], double val); 


void ConvertF32ToU32(void) 


{ 
uint32_t val_cvt[4]; 
const float val[] {(float)M_PI, (float)M_SORT2}; 
const int num vals = sizeof(val) / sizeof(float); 
cout << "\nConvertF32ToU32\n"; 
for (int i = 0; i < num vals; i++) 
( 
Avx512CvtF32ToU32 (val cvt, val[i]); 
cout << " Test case #" << i << " val = " << val[i] << ‘\n'; 
cout << " val cvt[0] {rn-sae} = " << val cvt[0] << '\n'; 
cout << " val cvt[1] {rd-sae} = " << val cvt[1] << '\n'; 
cout «« " val cvt[2] {ru-sae} = " << val cvt[2] << '\n'; 
cout << " val cvt[3] {rz-sae} = " << val cvt[3] << ‘\n'; 
) 
) 
void ConvertF64ToU64(void) 
( 
uint64 t val cvt[4]; 
const double val[] {(float)M PI, (float)M SORT2}; 
const int num vals - sizeof(val) / sizeof(double); 
cout << "\nConvertF64ToU64\n"; 
for (int i = 0; i « num vals; i++) 
{ 
Avx512CvtF64ToU64 (val cvt, val[i]); 
cout << " Test case #" << i << " val = " << val[i] << ‘\n'; 
cout «« " val cvt[0] {rn-sae} = " << val cvt[0] << '\n'; 
cout «« " val cvt[1] {rd-sae} = " << val cvt[1] << '\n'; 
cout «« " val cvt[2] {ru-sae} = " << val cvt[2] << '\n'; 
cout «« " val cvt[3] {rz-sae} = " << val cvt[3] << ‘\n'; 
) 
) 
void ConvertF64ToF32(void) 
{ 


float val_cvt[4]; 
const double val[] {M PI, -M SORT2}; 
const int num vals = sizeof(val) / sizeof(double); 


cout << "\nConvertF64ToF32\n"; 


for (int i = 0; i < num vals; i++) 
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fixed << setprecision(7); 


Test case #" << i << 
val cvt[0] {rn-sae} 


{rd-sae} 
{ru-sae} 
{rz-sae} 


val - 


<< 
<< 
<< 
<< 


" << val[i] «« 
val cvt[0] «« 
val cvt[1] «« 
val cvt[2] «« 
val cvt[3] «« 


{ 
Avx512CvtF64ToF32 (val cvt, val[i]); 
cout «« 
cout << " 
cout << " [ 
cout << " val cvt[1] 
cout << " val cvt[2] 
cout << " val cvt[3] 
) 
) 
int main() 
( 
ConvertF32ToU32(); 
ConvertF64ToU64() ; 
ConvertF64ToF32(); 
return 0; 
) 
> mm 
í Ch13_03.asm 


; extern "C" void Avx512CvtF32ToU32 (uint32 t val cvt[4], float val); 


.code 


Avx512CvtF32T0U32 proc 
vcvtss2usi eax,xmm1{rn-sae} 


mov dword ptr [rcx],eax 


vcvtss2usi eax,xmmi(rd-sae) 
mov dword ptr [rcx+4],eax 


vcvtss2usi eax,xmmi(ru-sae) 
mov dword ptr [rcx«8],eax 


vcvtss2usi eax,xmmi(rz-sae) 
mov dword ptr [rcx+12],eax 


ret 


Avx512CvtF32ToU32 endp 


;Convert using round to nearest 


;Convert using round down 


;Convert using round up 


;Convert using round to zero (truncate) 


; extern "C" void Avx512CvtF64ToU64 (uint64 t val cvt[4], double val); 


Avx512CvtF64T0U64_ proc 
vcvtsd2usi rax,xmmi(rn-sae) 


mov qword ptr [rcx],rax 


vcvtsd2usi rax,xmmi(rd-sae) 
mov qword ptr [rcx«8],rax 
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vcvtsd2usi rax,xmmi(ru-sae) 
mov qword ptr [rcx+16],rax 


vcvtsd2usi rax,xmmi(rz-sae) 
mov qword ptr [rcx«24],rax 
ret 

Avx512CvtF64ToU64  endp 


; extern "C" void Avx512CvtF64ToF32 (float val cvt[4], double val); 


Avx512CvtF64T0F32_ proc 
vcvtsd2ss xmm2,xmm2,xmmi{rn-sae} 
vmovss real4 ptr [rcx],xmm2 


vcvtsd2ss xmm2,xmm2,xmm1{rd-sae} 
vmovss real4 ptr [rcx+4],xmm2 


vcvtsd2ss xmm2,xmm2,xmm1{ru-sae} 
vmovss real4 ptr [rcx+8],xmm2 


vcvtsd2ss xmm2,xmm2,xmm1{rz-sae} 
vmovss real4 ptr [rcx+12],xmm2 
ret 

Avx512CvtF64ToF32_ endp 
end 


The C++ code in Listing 13-3 begins with the function ConvertF32T0U32. This function performs test 
case initialization and exercises the assembly language function Avx512CvtF32T0U32 , which converts a 
single-precision floating-point value to an unsigned doubleword (32-bit) integer using different rounding 
modes. The results are then streamed to cout. The C++ functions ConvertF64ToU64 and ConvertF64ToF32 
carry out similar test case initializations for the assembly language functions Avx512CvtF64ToU64 and 
Avx512CvtF64ToF32_, respectively. 

The first instruction of assembly language function Avx512CvtF32T0U32_, vcvtss2usi eax, xmmi{rn- 
sae} converts the scalar single-precision floating-point value in XMM1 (or val) to an unsigned 
doubleword integer using the rounding mode round-to-nearest. As mentioned in Chapter 12, the -sae 
suffix that’s appended to the embedded rounding mode string is a reminder that floating-point exceptions 
and MXCSR flag updates are always disabled when an instruction-level rounding control operand is 
specified. The ensuing mov dword ptr [rcx],eax instruction saves the converted result in val_cvt[0]. 
Avx512CvtF32ToU32 and then employs additional vcvtss2usi instructions to carry out the same conversion 
operation using rounding modes round-down, round-up, and round-to-zero. The organization of function 
Avx512CvtF64T0U64_ is similar to Avx512CvtF32ToU32_ and uses the vcvtsd2usi instruction to convert a 
double-precision floating-point value to an unsigned quadword integer. Note that both vcvtss2usi and 
vcvtsd2usi are new AVX-512 instructions. AVX-512 also includes the instructions vcvtusi2s[d|s], which 
perform unsigned integer to floating-point conversions. Neither AVX nor AVX2 include instructions that 
perform these types of conversions. 
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The final assembly language function, Avx512CvtF64ToF32_, applies the vcvtsd2ss instruction to 
convert a double-precision floating-point value to single-precision floating-point. The vcvtsd2ss instruction 
is an existing AVX instruction that can be used with an instruction-level rounding control operand on 
systems that support AVX-512. Here is the output for source code example Ch13 03. 


ConvertF32ToU32 

Test case #0 val = 3.14159 
val cvt[0] {rn-sae} = 3 

val cvt[1] {rd-sae} = 3 

val cvt[2] {ru-sae} = 4 

val cvt[3] {rz-sae} = 3 
Test case #1 val = 1.41421 
1 

1 

2 

1 


nm —-— 


val cvt[0] {rn-sae} - 
val cvt 
val cvt 
val cvt 


0 

1] {rd-sae} - 
2] {ru-sae} - 
3] {rz-sae} - 


gummi rem ums py 


ConvertF64ToU64 
Test case #0 val = 3.14159 
val cvt[0] {rn-sae} = 3 


[ 
val cvt[1] {rd-sae} = 3 
val cvt[2] {ru-sae} = 4 
val cvt[3] {rz-sae} = 3 
Test case #1 val = 1.41421 
val cvt[0] {rn-sae} = 1 
val cvt[1] {rd-sae} = 1 
val cvt[2] {ru-sae} = 2 
val cvt[3] {rz-sae} = 1 
ConvertF64ToF32 


Test case #0 val = 3.1415927 
val cvt[0] {rn-sae} = 3.1415927 


[ 
val cvt[1] {rd-sae} = 3.1415925 
val cvt[2] {ru-sae} = 3.1415927 
val cvt[3] {rz-sae} = 3.1415925 
Test case #1 val = -1.4142136 
val cvt[0] {rn-sae} = -1.4142135 
val cvt[1] {rd-sae} = -1.4142137 
val cvt[2] {ru-sae} = -1.4142135 
val cvt[3] {rz-sae} = -1.4142135 


Packed Floating-Point 


The source code examples of this section illustrate how to use AVX-512 instructions to carry out 
computations using packed floating-point operands. The first three source code examples demonstrate 
basic operations with 512-bit wide packed floating-point operands including simple arithmetic, compare 
operations, and merge masking. The remaining examples focus on specific algorithms including vector cross 
product calculations, matrix-vector multiplications, and convolutions. 
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Packed Floating-Point Arithmetic 


Listing 13-4 shows the source code for example Ch13 04. This example demonstrates how to perform 
common arithmetic operations using 512-bit wide single-precision and double-precision floating-point 
operands. It also highlights some of the similarities between AVX/AVX2 and AVX-512 programming. 


Listing 13-4. Example Ch13 04 


#pragma once 

#include «string» 
#include «cstdint» 
#include <sstream> 
#include <iomanip> 


struct ZmmVal 


{ 
public: 
union 
{ 
int8 t m I8[64]; 
int16 t m I16[32]; 
int32 t m I32[16]; 
int64 t m I64[8]; 
uint8 t m U8[64]; 
uint16 t m U16[32]; 
uint32 t m U32[16]; 
uint64 t m U64[8]; 
float m F32[16]; 
double m F64[8]; 

n 
ee ce ne rue 
// Ch13 04.cpp 
EE. 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 
#define USE MATH DEFINES 


include «math.h» 
#include "ZmmVal.h" 


using namespace std; 


extern "C" void Avx512PackedMathF32 (const ZmmVal* a, const ZmmVal* b, ZmmVal c[8]); 
extern "C" void Avx512PackedMathF64 (const ZmmVal* a, const ZmmVal* b, ZmmVal c[8]); 


void Avx512PackedMathF32(void) 


445 


CHAPTER 13 = AVX-512 PROGRAMMING — FLOATING-POINT 


( 
alignas(64) ZmmVal a; 
alignas(64) ZmmVal b; 
alignas(64) ZmmVal c[8]; 
a.m F32[0] = 36.0f; b.m F32[0] = -0.1111111f; 
a.m F32[1] = 0.03125f; b.m F32[1] = 64.0f; 
a.m F32[2] = 2.0f; b.m F32[2] = -0.0625f; 
a.m F32[3] = 42.0f; b.m F32[3] = 8.666667f; 
a.m F32[4] = 7.0f; b.m F32[4] = -18.125f; 
a.m F32[5] = 20.5f; b.m F32[5] = 56.0f; 
a.m F32[6] = 36.125f; b.m F32[6] = 24.0f; 
a.m F32[7] = 0.5f; b.m F32[7] = -158.6f; 
a.m F32[8] = 136.0f; b.m F32[8] = -9.1111111f; 
a.m F32[9] = 2.03125f; b.m F32[9] = 864.0f; 
a.m F32[10] = 32.0f; b.m F32[10] = -70.0625f; 
a.m F32[11] = 442.0f; b.m F32[11] = 98.666667f; 
a.m F32[12] = 57.0f; b.m F32[12] = -518.125f; 
a.m F32[13] = 620.5f; b.m F32[13] = 456.0f; 
a.m F32[14] = 736.125f; b.m F32[14] = 324.0f; 
a.m F32[15] = 80.5f; b.m F32[15] = -298.6f; 
Avx512PackedMathF32 (8a, &b, c); 
cout << ("\nResults for Avx512PackedMathF32\n"); 
for (int i = 0; i« 4; i++) 
{ 
cout << "Group #" << i << '\n'; 
cout << " a: " << a.ToStringF32(i) << '\n'; 
cout << " b: " << b.ToStringF32(i) << '\n'; 
cout << " addps:  " << c[o].ToStringF32(i) << '\n'; 
cout << " subps:  " << c[1].ToStringF32(i) << '\n'; 
cout << " mulps:  " << c[2].ToStringF32(i) << '\n'; 
cout << " divps:  " << c[3].ToStringF32(i) << '\n'; 
cout << " absps:  " << c[4].ToStringF32(i) << '\n'; 
cout << " sqrtps: " << c[5].ToStringF32(i) << '\n'; 
cout << " minps:  " << c[6].ToStringF32(i) << '\n'; 
cout << " maxps:  " << c[7].ToStringF32(i) << '\n'; 
cout << '\n'; 
} 
} 
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void Avx512PackedMathF64(void) 


{ 
alignas(64) ZmmVal a; 
alignas(64) ZmmVal b; 
alignas(64) ZmmVal c[8]; 
a.m F64[0] = 2.0; b.m F64[0] = M PI; 
a.m F64[1] = 4.0 ; b.m F64[1] = M E; 
a.m F64[2] = 7.5; b.m F64[2] = -9.125; 
a.m F64[3] = 3.0; b.m F64[3] = -M PI; 
a.m F64[4] = 12.0; b.m F64[4] = M PI / 2; 
a.m F64[5] = 24.0; b.m F64[5] = ME / 2; 
a.m F64[6] = 37.5; b.m F64[6] = -9.125 / 2; 
a.m F64[7] = 43.0; b.m F64[7] = -M PI / 2; 
Avx512PackedMathF64 (&a, &b, c); 
cout << ("\nResults for Avx512PackedMathF64\n") ; 
for (int i = 0; i < 4; i++) 
{ 
cout << "Group #" << i << '\n'; 
cout << " a: " << a.ToStringF64(i) << ‘\n'; 
cout << " b: " << b.ToStringF64(i) << ‘\n'; 
cout << " addpd:  " << c[o].ToStringF64(i) << 'Mn'; 
cout << " subpd:  " << c[1].ToStringF64(i) << 'Mn'; 
cout << " mulpd:  " << c[2].ToStringF64(i) << 'Mn'; 
cout << " divpd:  " << c[3].ToStringF64(i) << 'Mn'; 
cout << " abspd:  " << c[4].ToStringF64(i) << 'Mn'; 
cout << " sqrtpd: " << c[5].ToStringF64(i) << 'Mn'; 
cout << " minpd:  " << c[6].ToStringF64(i) << 'Mn'; 
cout << " maxpd:  " << c[7].ToStringF64(i) << 'Mn'; 
cout << ‘\n'; 
} 
} 
int main() 
{ 
Avx512PackedMathF32(); 
Avx512PackedMathF64(); 
return 0; 
} 
> wee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ae ee ee ee ee ee ee ee 
3 Ch13_04.asm 
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; Mask values used to calculate floating-point absolute values 
ConstVals segment readonly align(64) 'const' 

AbsMaskF32 dword 16 dup(7fffffffh) 

AbsMaskF64 qword 8 dup(7fffffffffffffffh) 

ConstVals ends 


; extern "C" void Avx512PackedMathF32 (const ZmmVal* a, const ZmmVal* b, ZmmVal c[8]); 


.code 
Avx512PackedMathF32_ proc 


; Load packed SP floating-point values 
vmovaps zmmO,zmmword ptr [rcx] ;zmmo = *a 
vmovaps zmm1,zmmword ptr [rdx] ;zmm1 = *b 


; Packed SP floating-point addition 
vaddps zmm2,zmmO, zmm1 
vmovaps zmmword ptr [r840],zmm2 


; Packed SP floating-point subtraction 
vsubps zmm2,zmmo,zmmi 
vmovaps zmmword ptr [r8464],zmm2 


; Packed SP floating-point multiplication 
vmulps zmm2,zmmo,zmmi 
vmovaps zmmword ptr [r8+128],zmm2 


; Packed SP floating-point division 
vdivps zmm2,zmmO, zmm1 
vmovaps zmmword ptr [r84192],zmm2 


; Packed SP floating-point absolute value (b) 
vandps zmm2,zmm1,zmmword ptr [AbsMaskF32] 
vmovaps zmmword ptr [r84256],zmm2 


; Packed SP floating-point square root (a) 
vsqrtps zmm2,zmmo 
vmovaps zmmword ptr [r84320],zmm2 


; Packed SP floating-point minimum 
vminps zmm2,zmmo,zmmi 
vmovaps zmmword ptr [r84384],zmm2 


; Packed SP floating-point maximum 
vmaxps zmm2,zmmo,zmmi 
vmovaps zmmword ptr [r84448],zmm2 


vzeroupper 


ret 
Avx512PackedMathF32_  endp 
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; extern "C" void Avx512PackedMathF64 (const ZmmVal* a, const ZmmVal* b, ZmmVal c[8]); 
Avx512PackedMathF64_ proc 
; Load packed DP floating-point values 


vmovapd zmmo,zmmword ptr [rcx] ;zmmo 
vmovapd zmm1,zmmword ptr [rdx] ;zmmi 


*g 
*b 


; Packed DP floating-point addition 
vaddpd zmm2,zmmo, zmmi 
vmovapd zmmword ptr [r8+0],zmm2 


; Packed DP floating-point subtraction 
vsubpd zmm2,zmmO, zmm1 
vmovapd zmmword ptr [r8+64],zmm2 


; Packed DP floating-point multiplication 
vmulpd zmm2,zmmO, zmm1 
vmovapd zmmword ptr [r8+128],zmm2 


; Packed DP floating-point division 
vdivpd zmm2,zmmO,zmm1 
vmovapd zmmword ptr [r8+192],zmm2 


; Packed DP floating-point absolute value (b) 
vandpd zmm2,zmm1,zmmword ptr [AbsMaskF64] 
vmovapd zmmword ptr [r8+256],zmm2 


; Packed DP floating-point square root (a) 
vsqrtpd zmm2,zmmO 
vmovapd zmmword ptr [r8+320],zmm2 


; Packed DP floating-point minimum 
vminpd zmm2,zmmo,zmmi 
vmovapd zmmword ptr [r8+384],zmm2 


; Packed DP floating-point maximum 
vmaxpd zmm2,zmmo,zmmi 
vmovapd zmmword ptr [r84448],zmm2 


vzeroupper 
ret 

Avx5i2PackedMathF64  endp 
end 


Listing 13-4 starts with the declaration of the C++ structure ZmmVal, which is declared in the header file 
ZmmVal.h. This structure is analogous to the XmmVal and YmmVal structures that were used by the source code 
examples in Chapters 6 and 9. The structure ZmmVal contains a publicly-accessible anonymous union that 
simplifies packed operand data exchange between functions written in C++ and the x86 assembly language. 
The members of this union correspond to the packed data types that can be used with a ZMM register. The 
structure ZmmVal also includes several string formatting functions for display purposes (the source code for 
these member functions is not shown). 
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The remaining C++ code in Listing 13-4 is similar to the code that was used in example Ch09_01. The 
declarations for assembly language functions Avx512PackedMathF32 and Avx512PackedMathF64_ follow 
the declaration of structure ZmmVal. These functions carry out various packed single-precision and double- 
precision floating-point arithmetic operations using the supplied ZmmVal arguments. The C++ functions 
Avx512PackedMathF32 and Avx512PackedMathF64 perform ZmmVal variable initializations, invoke the 
assembly language calculating functions, and display results. Note that the alignas (64) specifier is used 
with each ZmmVal variable definition. 

The assembly language code in Listing 13-4 begins with a 64-byte aligned custom memory segment 
named ConstVals. This segment contains definitions for the packed constant values that are used in the 
calculating functions. A custom segment is used here since the MASM align directive does not support 
aligning data items on a 64-byte boundary. Chapter 9 contains additional information about custom 
memory segments. The segment ConstVals contains the constants AbsMaskF32 and AbsMaskF64, which are 
used to calculate absolute values for 512-bit wide packed single-precision and double-precision floating- 
point values. 

The first instruction of Avx512PackedMathF32 , vmovaps zmmo,ymmword ptr [rcx],loads argument a 
(the 16 floating-point values in ZmmVal a) into register YMMO. The vmovaps can be used here since ZmmVal 
a was defined using the alignas(64) specifier. The operator zmmword ptr directs the assembler to treat the 
memory location pointed to by RCX as a 512-bit wide operand. Like the operators xmmword ptr and ymmword 
ptr,the zmmword ptr operator is often used to improve code readability even when it's not explicitly 
required. The ensuing vmovaps zmm1,zmmword ptr [rdx] instruction loads ZmmVal b into register ZMM1. 
The vaddps zmm2,zmmO, zmm1 instruction that follows sums the packed single-precision floating-point values 
in ZMMO and ZMMI and saves the result in ZMM2. The vmovaps zmmword ptr [r8],zmm2 instruction saves 
the packed sums to c [0]. 

The ensuing vsubps, vmulps, and vdivps instructions carry out packed single-precision floating- 
point subtraction, multiplication, and division. This is followed by a vandps zmm2,zmm1,zmmword ptr 
[AbsMaskF32] instruction that calculates packed absolute values using argument b. The remaining 
instructions in Avx512PackedMathF32 calculate packed single-precision floating-point square roots, 
minimums, and maximums. 

Prior to its ret instruction, the function AvxPackedMath32 uses a vzeroupper instruction, which zeros 
the high-order 384 bits of registers ZMM0-ZMMIS. As explained in Chapter 4, a vzeroupper instruction is 
used here to avoid potential performance delays that can occur whenever the processor transitions from 
executing x86-AVX code to executing x86-SSE code. Any assembly language function that uses one or more 
YMM or ZMM registers and is callable from code that potentially uses x86-SSE instructions should ensure 
that a vzeroupper instruction is executed before program control is transferred back to the calling function. 
It should be noted that according to the Intel 64 and IA-32 Architectures Optimization Reference Manual, 
the vzeroupper use recommendations apply to functions that employ x86-AVX instructions with registers 
ZMM0-ZMM15 or YMM0-YMMIS. Functions that only exploit registers ZMM16-ZMM31 or YMM16- 
YMMB31 do not need to observe the vzeroupper use recommendations. 

The organization of function Avx512PackedMathF64 is similar to Avx512PackedMathF32 . 
Avx5i2PackedMathF64 carries out its calculations using the double-precision versions of the same AVX-512 
instructions that are used in Avx512PackedMathF32 . Here is the output for source code example Ch13 04: 


Results for Avx512PackedMathF32 


Group #0 
a: 36.000000 0.031250 | 2.000000 42.000000 
b: -0.111111 64.000000 | -0.062500 8.666667 
addps: 35.888889 64.031250 | 1.937500 50.666668 
subps: 36.111111 -63.968750 | 2.062500 33.333332 
mulps: -4.000000 2.000000 | -0.125000 364.000000 
divps: -324.000031 0.000488 | -32.000000 4.846154 
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absps: 0.111111 64.000000 0.062500 8.666667 
sqrtps: 6.000000 0.176777 1.414214 6.480741 
minps: -0.111111 0.031250 -0.062500 8.666667 
maxps : 36.000000 64.000000 2.000000 42.000000 
Group #1 
a: 7.000000 20.500000 36.125000 0.500000 
b: -18.125000 56.000000 24.000000 -158.600006 
addps: -11.125000 76.500000 60.125000 -158.100006 
subps : 25.125000 -35.500000 12.125000 159.100006 
mulps: -126.875000 1148.000000 867.000000 -79.300003 
divps: -0.386207 0.366071 1.505208 -0.003153 
absps: 18.125000 56.000000 24.000000 158.600006 
sartps: 2.645751 4.527693 6.010407 0.707107 
minps : -18.125000 20.500000 24.000000 -158.600006 
maxps : 7.000000 56.000000 36.125000 0.500000 
Group #2 
a: 136.000000 2.031250 32.000000 442.000000 
b: -9.111111 864.000000 -70.062500 98.666664 
addps: 126.888885 866.031250 -38.062500 540.666687 
subps: 145.111115 -861.968750 102.062500 343.333344 
mulps: -1239.111084 1755.000000 -2242.000000 43610.664063 
divps: -14.926830 0.002351 -0.456735 4.479730 
absps: 9.111111 864.000000 70.062500 98.666664 
sqrtps: 11.661903 1.425219 5.656854 21.023796 
minps: -9.111111 2.031250 -70.062500 98.666664 
maxps : 136.000000 864.000000 32.000000 442.000000 
Group #3 
a: 57.000000 620.500000 736.125000 80.500000 
b: -518.125000 456.000000 324.000000 -298.600006 
addps: -461.125000 1076.500000 1060.125000 -218.100006 
subps: 575.125000 164.500000 412.125000 379.100006 
mulps: -29533.125000 282948 . 000000 238504.500000 -24037.300781 
divps: -0.110012 1.360746 2.271991 -0.269591 
absps: 518.125000 456 . 000000 324.000000 298 . 600006 
sqrtps: 7.549834 24.909838 27.131624 8.972179 
minps : -518.125000 456.000000 324.000000 -298.600006 
maxps : 57.000000 620.500000 736.125000 80.500000 
Results for Avx512PackedMathF64 
Group #0 
a: 2.000000000000 4.000000000000 
b: 3.141592653590 2.718281828459 
addpd: 5.141592653590 6.718281828459 
subpd: -1.141592653590 1.281718171541 
mulpd: 6.283185307180 10.873127313836 
divpd: 0.636619772368 1.471517764686 
abspd: 3.141592653590 2.718281828459 
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sqrtpd: 1.414213562373 2.000000000000 
minpd: 2.000000000000 2.718281828459 
maxpd: 3.141592653590 4.000000000000 
Group #1 
a: 7. 500000000000 3.000000000000 
b: -9.125000000000 -3.141592653590 
addpd: -1.625000000000 -0.141592653590 
subpd: 16.625000000000 6.141592653590 
mulpd: -68.437500000000 -9.424777960769 
divpd: -0.821917808219 -0.954929658551 
abspd: 9.125000000000 3.141592653590 
sqrtpd: 2.738612787526 1.732050807569 
minpd: -9.125000000000 -3.141592653590 
maxpd: 7.500000000000 3.000000000000 
Group #2 
a: 12.000000000000 24.000000000000 
b: 1.570796326795 1.359140914230 
addpd: 13.570796326795 25.359140914230 
subpd: 10.429203673205 22.640859085770 
mulpd: 18.849555921539 32.619381941509 
divpd: 7.639437268411 17.658213176229 
abspd: 1.570796326795 1.359140914230 
sqrtpd: 3.464101615138 4.898979485566 
minpd: 1.570796326795 1.359140914230 
maxpd: 12.000000000000 24.000000000000 
Group #3 
a: 37.500000000000 43.000000000000 
b: -4.562500000000 -1.570796326795 
addpd: 32.937500000000 41.429203673205 
subpd: 42.062500000000 44.570796326795 
mulpd: -171.093750000000 -67.544242052181 
divpd: -8.219178082192 -27.374650211806 
abspd: 4.562500000000 1.570796326795 
sqrtpd: 6.123724356958 6.557438524302 
minpd: -4.562500000000 -1.570796326795 
maxpd: 37.500000000000 43.000000000000 


Packed Floating-Point Compares 


In Chapter 6 you learned how to use the vcmpp[s |d] instructions to perform packed single-precision and 
double-precision floating-point compare operations (see source code example Ch06 02). Recall that the AVX 
version of these instructions set the elements of a SIMD operand to all zeros or all ones to indicate the result 
of a compare operation. In this section, you learn how to use the AVX-512 version of the vcmpps instruction, 
which saves its compare result in an opmask register. Listing 13-5 shows the C++ and assembly language 
code for example Ch13 05. 
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Listing 13-5. Example Ch13_05 


#include "stdafx.h" 
#include <cstdint> 
#include <iostream> 
#include <iomanip> 
#include «limits» 

#include "ZmmVal.h" 


using namespace std; 
extern "C" void Avx512PackedCompareF32_(const ZmmVal* a, const ZmmVal* b, uint16 t c[8]); 


const char* c CmpStr[8] = 


{ 
"EQ", "NE", " LT". "LE", "gT"; "GE", "ORDERED", "UNORDERED" 
}; 
void ToZmmVal(ZmmVal des[8], uint16 t src[8]) 
{ 
for (size t i = 0; i < 8; i++) 
{ 
uint16 t val src = src[i]; 
for (size t j = 0; j « 16; j++) 
des[i].m U32[j] = val_src 8 (1 << j)? 1 : 0; 
) 
) 
void Avx512PackedCompareF32(void) 
{ 


alignas(64) ZmmVal a; 
alignas(64) ZmmVal b; 
uint16 t c[8]; 


a.m F32[0] = 2.0; b.m F32[0] = 1.0; 

a.m F32[1] = 7.0; b.m F32[1] = 12.0; 

a.m F32[2] = -6.0; b.m F32[2] = -6.0; 

a.m F32[3] = 3.0; b.m F32[3] = 8.0; 

a.m F32[4] = -2.0; b.m F32[4] = 1.0; 

a.m F32[5] = 17.0; b.m F32[5] = 17.0; 

a.m F32[6] = 6.5; b.m F32[6] = -9.125; 

a.m F32[7] = 4.875; b.m F32[7] = numeric limits«float»::quiet NaN(); 
a.m F32[8] = 2.0; b.m F32[8] = 101.0; 

a.m F32[9] = 7.0; b.m F32[9] = -312.0; 
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a.m F32[10] = -5.0; b.m F32[10] = 15.0; 
a.m F32[11] = -33.0; b.m F32[11] = -33.0; 
a.m F32[12] = -12.0; b.m F32[12] = 198.0; 
a.m F32[13] = 107.0; b.m F32[13] = 107.0; 
a.m F32[14] = 16.125; b.m F32[14] = -2.75; 
a.m F32[15] = 42.875; b.m F32[15] = numeric limits«float»::quiet NaN(); 


Avx512PackedCompareF32 (8a, &b, c); 

cout << "\nResults for Avx512PackedCompareF32\n"; 
ZmmVal c display[8]; 

TozmmVal(c display, c); 


for (int sel = 0; sel « 4; sel++) 


{ 
cout << setw(12) << "a[" << sel << "]:" << a.ToStringF32(sel) << '\n'; 
cout << setw(12) << "b[" << sel << "]:" << b.ToStringF32(sel) << '\n'; 
cout << '\n'; 
for (int j = 0; j < 8; j++) 
cout << setw(14) << c CmpStr[j] << ':' << c display[j].ToStringU32(sel) << 
"in's 
cout << '\n'; 
} 
} 
int main() 
{ 
Avx512PackedCompareF32(); 
return 0; 
} 
> ---------2--2-222-22-2-2-2-2-22-2-22-2-2-2-22-2-2--2--2-2--2--2-2-2-2--2-2--2-------- 
5 Ch13 05.asm 


include «cmpequ.asmh» 


; extern "C" void Avx512PackedCompareF32 (const ZmmVal* a, const ZmmVal* b, ZmmVal c[8]); 


.code 

Avx512PackedCompareF32_ proc 
vmovaps zmmO, [rcx] ;zmmo = a 
vmovaps zmm1, [rdx] ;zmmi = b 


; Perform packed EQUAL compare 
vcmpps ki,zmmO,zmmi,CMP EO 
kmovw word ptr [r8],ki 
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; Perform packed NOT EQUAL compare 
vcmpps k1,zmmO,zmmi,CMP NEQ 
kmovw word ptr [r8+2],k1 


; Perform packed LESS THAN compare 
vcmpps ki,zmmO,zmmi,CMP LT 
kmovw word ptr [r8+4],k1 


; Perform packed LESS THAN OR EQUAL compare 
vcmpps k1,zmmO,zmmi,CMP LE 
kmovw word ptr [r846],ki 


; Perform packed GREATER THAN compare 
vcmpps ki,zmmO,zmmi,CMP GT 
kmovw word ptr [r8+8],k1 


; Perform packed GREATER THAN OR EQUAL compare 
vcmpps k1,zmmO,zmm1,CMP_GE 
kmovw word ptr [r8+10],k1 


; Perform packed ORDERED compare 
vcmpps ki,zmmO,zmmi,CMP ORD 
kmovw word ptr [r8+12],k1 


; Perform packed UNORDERED compare 
vcmpps k1,zmmO, zmm1,CMP_UNORD 
kmovw word ptr [r8+14],k1 


vzeroupper 
ret 

Avx512PackedCompareF32_ endp 
end 


The C++ function Avx512PackedCompareF32 that's shown in Listing 13-5 starts its execution by loading 
test values into the single-precision floating-point elements of ZmmVal variables a and b. Note that these 
variables are defined using the C++ alignas(64) specifier. Following variable initialization, the function 
Avx512PackedCompareF32 invokes the assembly language function Avx512PackedCompareF32 to perform 
the packed compares. It then streams the results to cout. 

The assembly language function Avx512PackedCompareF32_ begins its execution with two vmovaps 
instructions that load ZmmVal variables a and b into registers ZMMO and ZMMI, respectively. The ensuing 
vcmpps ki,zmmO,zmmi,CMP EO instruction compares the single-precision floating-point elements in registers 
ZMMO and ZMMI for equality. For each element position, this instruction sets the corresponding bit 
position in opmask register K1 to one if the values in ZMMO and ZMM1 are equal; otherwise, the opmask 
register bit is set to zero. Figure 13-1 illustrates this operation in greater detail. The kmovw word ptr [r8],ki 
instruction that follows saves the resultant mask to c[0]. 
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Initial values 


EN |e CE CHE evo aa] 
EC EE EC EC EE EEE DE n 


vcmpps k1,zmmO,zmm1,CMP_EQ 


Fea Fs ES E ESI ES EG ES E «0050 


Figure 13-1. Example execution of the vcmpps k1, zmmO, zmm1, CMP EQ instruction 


The remaining code in Avx512PackedCompareF32_ performs additional compare operations using the 
vcmpps instruction, ZmmVal variables a and b, and common compare predicates. Note that like the previous 
example, Avx512PackedCompareF32 uses a vzeroupper instruction prior to its ret instruction. Here are the 
results for source code example Ch13 05. 


Results for Avx512PackedCompareF32 


a[0]: 2.000000 7.000000 -6.000000 3.000000 
b[o]: 1.000000 12.000000 -6.000000 8.000000 

EQ: 0 0 1 0 

NE: sh 1 0 1 

ET? 0 1 0 1 

LE: 0 1 1 1 

GT: 1 0 0 0 

GE: 1 0 1 0 
ORDERED: 1 1 1 1 
UNORDERED: 0 0 0 0 
a[1]: -2.000000 17.000000 6.500000 4.875000 
b[1]: 1.000000 17.000000 -9.125000 nan 
EQ: 0 1 0 0 

NE: 1 0 1 1 

LT: 1 0 0 0 

LE: 1 1 0 0 

GT: 0 0 1 0 

GE: 0 1 1 0 
ORDERED: 1 1 1 0 
UNORDERED: 0 0 0 1 
a[2]: 2.000000 7.000000 -5.000000 -33.000000 
b[2]: 101.000000 -312.000000 15.000000 -33.000000 

EQ: 0 0 0 1 

NE: Í 1 1 0 

LT: 1 0 1 0 

LE: 1 0 1 1 
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GT: 0 1 0 0 

GE: 0 1 0 1 
ORDERED: 1 1 1 1 
UNORDERED: 0 0 0 0 
a[3]: -12.000000 107.000000 16.125000 42.875000 
b[3]: 198.000000 107.000000 -2.750000 nan 

EQ: 0 1 0 0 

NE: 1 0 1 1 

LT: 1 0 0 0 

LE: 1 1 0 0 

GT: 0 0 1 0 

GE: 0 1 1 0 
ORDERED: 1 1 1 0 
UNORDERED: 0 0 0 1 


On systems that support AVX-512, assembly language functions can also use the vcmppd instruction 
with a destination operand opmask register to perform packed double-precision floating-point compares. 
In these instances, the resultant mask is saved in the low-order eight bits of the destination operand opmask 
register. 


Packed Floating-Point Column Means 


Listing 13-6 shows the source code for example Ch13 06. This example, which is an AVX-512 implementation 
of source code example Ch09 03, calculates columns means for a two-dimensional array of double-precision 
floating-point values. To make the current source code example a little more interesting, the column means 
are calculated using only the array elements that are above a predetermined threshold value. 


Listing 13-6. Example Ch13 06 


#include "stdafx.h" 
include <iostream> 
#include «iomanip» 
include «random» 
#include «memory» 


using namespace std; 
// Test size limits to illustrate argument checking 


extern "C" size t c NumRowsMax - 1000000; 
extern "C" size t c NumColsMax - 1000000; 


extern "C" bool Avx512CalcColumnMeans (const double* x, size t nrows, size t ncols, double* 
col means, size t* col counts, double x min); 
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void Init(double* x, size t n, int rng min, int rng max, unsigned int seed) 
( 

uniform int distribution«» ui dist (rng min, rng max]; 

default random engine rng (seed); 


for (size t i = 0; i< n; i++) 
x[i] = (double)ui dist(rng); 
} 


bool Avx512CalcColumnMeansCpp(const double* x, size t nrows, size t ncols, double* col_ 
means, size t* col counts, double x min) 


( 
// Make sure nrows and ncols are valid 
if (nrows == 0 || nrows > c NumRowsMax) 
return false; 
if (ncols == 0 || ncols > c NumColsMax) 
return false; 
// Initialize column means and column counts to zero 
for (size t i = 0; i « ncols; i++) 
( 
col means[i] = 0.0; 
col counts[i] = 0; 
) 
// Calculate column means 
for (size t i = 0; i < nrows; i++) 
{ 
for (size t j = 0; j < ncols; j++) 
double val = x[i * ncols + j]; 
if (val >= x min) 
{ 
col_means[j] += val; 
col_counts[j]++; 
} 
} 
} 
for (size t j = 0; j < ncols; j++) 
col means[j] /= col counts[j]; 
return true; 
) 


void Avx512CalcColumnMeans (void) 


{ 


const size t nrows = 20000; 
const size t ncols - 23; 
const int rng min - 1; 


458 


CHAPTER 13 = AVX-512 PROGRAMMING — FLOATING-POINT 


const int rng max = 999; 
const unsigned int rng seed = 47; 
const double x min = 75.0; 


unique ptr«double[]» x (new double[nrows * ncols]}; 

unique ptr«double[]» col meansi (new double[ncols]}; 
unique ptr«double[]» col means2 (new double[ncols]}; 
unique ptr«size t[]» col countsi {new size t[ncols]}; 
unique ptr«size t[]» col counts2 {new size t[ncols]}; 


Init(x.get(), nrows * ncols, rng min, rng max, rng seed); 


bool rci = Avx512CalcColumnMeansCpp(x.get(), nrows, ncols, col meansi.get(), col. 
countsi.get(), x min); 

bool rc2 = Avx512CalcColumnMeans (x.get(), nrows, ncols, col means2.get(), col counts2. 
get(), x min); 


cout << "Results for Avx512CalcColumnMeans Wn" ; 
if (!rc1 || !rc2) 


cout << "Invalid return code: "; 
cout << "rc1 = " << boolalpha << rci << ", "; 


cout << "rc2 = " << boolalpha << rc2 << '\n'; 
return; 
cout << "Test Matrix (nrows = " << nrows << ", ncols = " << ncols << ")Wn"; 


cout << "\nColumn Means\n"; 
cout << fixed << setprecision(4); 


for (size t j = 0; j < ncols; j++) 


cout << setw(4) << j << ": "; 
cout << "col means = "; 

cout << setw(10) << col meansi[j] << ", "; 
cout << setw(10) << col means2[j] << ; 
cout << "col counts = "; 

cout << setw(6) << col counts1[j] << ; 


: 
cout << setw(6) << col counts2[j] << ‘\n'; 


if (col means1[j] != col means2[j]) 
cout << "col means compare error\n"; 


if (col counts1[j] != col counts2[j]) 
cout << "col counts compare error\n"; 
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int main() 
{ 
Avx512CalcColumnMeans(); 
return 0; 
) 
E mm 
í Ch13_06.asm 


include <cmpequ.asmh> 
include <MacrosX86-64-AVX.asmh> 


extern c_NumRowsMax:qword 
extern c_NumColsMax:qword 


; extern "C" bool Avx512CalcColumnMeans (const double* x, size_t nrows, size t ncols, 
double* col means, size t* col counts, double x min); 


.code 
Avx512CalcColumnMeans_ proc frame 
_CreateFrame CCM ,0,0,rbx,r12,r13 


_EndProlog 
; Validate nrows and ncols 
xor eax,eax ;set error return code 
test rdx,rdx 
jz Done ;jump if nrows is zero 
cmp rdx,[c NumRowsMax] 
ja Done ;jump if nrows is too large 
test r8,r8 
jz Done ;jump if ncols is zero 
cmp r8,[c NumColsMax] 
ja Done ;jump if ncols is too large 


; Load argument values col counts and x min 
mov ebx,1 
vpbroadcastq zmm4,rbx ;zmm4 = 8 qwords of 1 
mov rbx,[rbp4CCM OffsetStackArgs]  ;rbx = col counts ptr 
lea r13,[rbp+CCM OffsetStackArgs+8] ;r13 = ptr to x min 


; Set initial col means and col counts to zero 
xor r10,r10 
vxorpd xmmO, xmmO , xmmo 


@@: vmovsd real8 ptr[r9+rax*8 ], xmmo ;col means[i] = 0.0 
mov [rbx+rax*8],r10 ;col counts[i] = 0 
inc rax 
cmp rax, 18 
jne @B ;repeat until done 
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; Compute the sum of each column in x 


LP1: xor r10,r10 ;110 - col index 
mov r11,r9 ;Y11 = ptr to col means 
mov r12,rbx ;112 - ptr to col counts 
LP2: mov rax,r10 ;rax = col index 
add rax,8 
cmp rax,r8 38 or more columns remaining? 
ja @F ;jump if col index + 8 > ncols 


; Update col means and col counts using next eight columns 


vmovupd zmmO,zmmword ptr [rcx] ;load next 8 cols of cur row 
vcmppd k1,zmmO,real8 bcst [r13],CMP GE ;k1 = mask of values >= x min 
vmovupd zmm1{k1}{z},zmmO ;values >= x min or 0.0 
vaddpd zmm2,zmm1,zmmword ptr [r11] ;add values to col means 
vmovupd zmmword ptr [r11],zmm2 ;save updated col means 
vpmovm2q zmmo,k1 ;convert mask to vector 
vpandq zmmi,zmmo,zmm4 ;qword values for add 

vpaddq zmm2,zmm1,zmmword ptr [r12] ;update col counts 

vmovdqu64 zmmword ptr [r12],zmm2 ;save updated col counts 

add r10,8 ;col index += 8 

add rcx,64 3X += 8 

add r11,64 ;col means += 8 

add r12,64 ;col counts += 8 


jmp NextColSet 


; Update col means and col counts using next four columns 


(QQ sub rax,4 
cmp rax,r8 ;4 or more columns remaining? 
ja @F ;jump if col index + 4 > ncols 
vmovupd ymmo,ymmword ptr [rcx] ;load next 4 cols of cur row 
vcmppd k1,ymmO,real8 bcst [r13],CMP GE ;k1 = mask of values >= x min 
vmovupd ymm1{k1}{z}, ymmo ;values >= x min or 0.0 
vaddpd ymm2,ymm1,ymmword ptr [r11] ;add values to col means 
vmovupd ymmword ptr [r11],ymm2 ;save updated col means 
vpmovm2q ymmo,k1 ;convert mask to vector 
vpandq ymmi,ymmo,ymm4 ;qword values for add 
vpaddq ymm2,ymm1,ymmword ptr [r12] ;update col counts 
vmovdqu64 ymmword ptr [r12],ymm2 ;save updated col counts 
add r10,4 ;col index += 4 
add rcx,32 3X += 4 
add r11,32 ;col means += 4 
add r12,32 ;col counts += 4 


jmp NextColSet 
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; Update col means and col counts using next two columns 


(QQ sub rax,2 
cmp rax,r8 ;2 or more columns remaining? 
ja GF ;jump if col index + 2 > ncols 
vmovupd xmmo,xmmword ptr [rcx] ;load next 2 cols of cur row 
vcmppd k1,xmmO,real8 bcst [r13],CMP GE ;k1 = mask of values >= x min 
vmovupd xmm1{k1}{z},xmmo ;values >= x min or 0.0 
vaddpd xmm2,xmm1,xmmword ptr [r11] ;add values to col means 
vmovupd xmmword ptr [r11],xmm2 ;save updated col means 
vpmovm2q xmmO,k1 ;convert mask to vector 
vpandq xmm1,xmmO , xmm4 ;qword values for add 
vpaddq xmm2,xmm1,xmmword ptr [r12] ;update col counts 
vmovdqu64 xmmword ptr [r12],xmm2 ;save updated col counts 
add r10,2 ;col index += 2 
add rcx,16 3X += 2 
add r11,16 ;col means += 2 
add r12,16 ;col counts 4- 2 


jmp NextColSet 


; Update col means using last column of current row 


@@: vmovsd xmmO,real8 ptr [rcx] ;load x from last column 
vempsd k1,xmmO,real8 ptr [r13],CMP GE  ;ki = mask of values >= x min 
vmovsd xmmi{k1}{z},xmm1, xmmo ;value or 0.0 
vaddsd xmm2,xmm1,real8 ptr [r11] ;add to col means 
vmovsd real8 ptr [r11],xmm2 ;save updated col means 
kmovb eax,k1 ;eax = O or 1 
add qword ptr [r12],rax ;update col counts 
add r10,1 ;col index += 1 
add rcx,8 ;update x ptr 

NextColSet: 
cmp r10,r8 ;more columns in current row? 
jb LP2 ;jump if yes 
dec rdx ;nrows -= 1 
jnz LP1 ;jump if more rows 


; Compute the final col means 


@@: vmovsd xmmO,real8 ptr [r9] ;xmmOo = col means[i] 
vcvtsi2sd xmm1,xmm1,qword ptr [rbx] ;xmm1 = col counts[i] 
vdivsd xmm2,xmmoO, xmm1 ;compute final mean 
vmovsd real8 ptr [r9],xmm2 ;save col mean[i] 
add r9,8 ;update col means ptr 
add rbx,8 ;update col counts ptr 
sub 18,1 ;ncols -- 1 
jnz GB ;repeat until done 
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mov eax,1 ;set success return code 


Done: | DeleteFrame rbx,r12,r13 
vzeroupper 
ret 


Avx512CalcColumnMeans_ endp 
end 


The function Avx512CalcColumnMeansCpp contains a C++ implementation of the columns means 
algorithm. This function uses two nested for loops to sum the elements of each column in the two- 
dimensional array. During each inner loop iteration, the value of array element x[i][j] is added to the 
current column running sum in col means[;j] only if it's greater than or equal to x min. The number of 
elements greater than or equal to x min in each column is maintained in the array col counts. Following 
the summing loops, the final column means are calculated using a simple for loop. 

Following its prolog, the function Avx512CalcColumnMeans validates argument values nrows and 
ncols. It then performs its required initializations. The mov ebx,1 and vpbroadcastq zmm4,rbx instructions 
load the value one into each quadword element of ZMMA. Registers RBX and R13 are then initialized as 
pointers to col counts and x min, respectively. The final initialization task employs a simple for loop that 
sets each element in col means and col counts to zero. 

Similar to source code example Ch09 03, the inner for loop in Avx512CalcColumnMeans employs 
slightly different instruction sequences to sum column elements, which vary depending on the number 
of columns in the array (see Figure 9-2) and the current column index. For each row, elements in the first 
eight columns of x can be added to col means using 512-bit wide packed double-precision floating-point 
addition. The remaining column element values are added to col means using 512-, 256-, or 128-bit wide 
packed or scalar double-precision floating-point addition. 

The outer loop label LP1 is the starting point for adding elements from the current row of x to col - 
means. The xor r10,r10 instruction initializes col index to zero; the mov r11,r9 instruction loads R11 
with a pointer to col means; and mov r12,rbx points R12 to col counts. Each iteration ofthe inner loop 
LP2 begins with a check to ensure that at least eight columns are available in the current row for processing. 
If eight columns are available, the vnovupd zmmO,zmmword ptr [rcx] instruction loads the next eight 
elements of the current row into register ZMMO. The ensuing vcmppd k1,zmm0,real8 bcst [r13],CMP GE 
instruction compares each element in ZMMO to x min and sets the corresponding bit position in opmask 
register K1 to indicate the result. Note that the embedded broadcast operand of the vpcmpd instruction is 
used here for demonstration purposes. In this source code example, it would be more efficient to initialize 
a packed version of x min prior to the start of the processing loops. The next instruction, vmovupd zmm1{k1} 
{z}, zmmo, uses zero masking to effectively eliminate values less than x min from subsequent calculations. 
The next two instructions, vaddpd zmm2,zmm1,zmmword ptr [r11] and vmovupd zmmword ptr [r11],zmm2, 
update the running column sums that are maintained in the col means. Figure 13-2 illustrates this operation 
in greater detail. 
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Initial values r13 75.0 


ESS ES ES ELS E30 E M 


vemppd kl,zmm0,real8 best [r13],CMP GE 


vmovupd zmml1{k1}{z},zmm0 


vaddpd zmm2,zrmml,zmmword ptr [r11] 


vmovupd zmmword ptr [ri1],zmm2 


Figure 13-2. Updating the intermediate sums in col means using zero merging 


The next code block updates the element counts in col counts. The vpmovm2q zmmo,k1 
instruction (Convert Mask Register to Vector Register) sets each quadword element in ZMMO to all ones 
(OxFFFFFFFFFFFFFFFF) or all zeros (0x0000000000000000) according to the value of the corresponding 
bit position in K1. The ensuing vpandq zmm1,zmmO, zmm4 instruction zeros the high-order 63 bits of 
each quadword value in ZMMO and saves this result in ZMM1. The next two instructions, vpaddq 
zmm2,zmm1,zmmword ptr [r12] and vmovdqu64 zmmword ptr [r12],zmm2, update the count values in 
col counts, as shown in Figure 13-3. The vmovdqu64 instruction saves the 512-bit wide packed quadword 
operand in ZMM2 to the location pointed to by register R12. AVX512F also includes a vmovdqu32 instruction 
for 512-bit wide packed doubleword moves. 


n 


64 


CHAPTER 13 = AVX-512 PROGRAMMING — FLOATING-POINT 


Initial values 


(r12) 


vpmovm2q zmm0, k1 


vpandq zmm1l,zmm0, zmm4 


vpaddq zmm2,zmm1,zmmword ptr [r12] 


vmovdqu64 zmmword ptr [r12], zmm2 


(ra) 


Figure 13-3. Updating the intermediate element counts in col_counts 


Following execution of the vmovdqu64 instruction, the algorithm’s various pointers and counters are 
updated to reflect the eight processed elements. The summation code repeats until the number of array 
elements that remain in the current row is less than eight. Once this condition is met, the remaining 
column elements (if any) are processed using 256-, 128-, or 64-bit wide operands using the same 


technique described in the previous paragraph. Note that function Avx512CalcColumnMeans_ uses AVX-512 


instructions that employ YMM or XMM registers with embedded broadcast and zero merging operands. 


These instructions require an AVX-512 conforming processor that supports the AVX512VL instruction set 
extension. After calculating the column sums, each element in col_means is divided by the corresponding 


element in col_counts to obtain the final column mean. Here are the results for source code example 


Ch13 06: 


Results for Avx512CalcColumnMeans 


Test Matrix (nrows - 20000, ncols - 23) 


Column Means 

0: col means = 536.6483, 536.6483 col counts = 18548, 18548 
col means = 535.8669, 535.8669 col counts = 18538, 18538 
col means = 534.7049, 534.7049 col counts = 18457, 18457 
col means = 535.8747, 535.8747 col counts = 18544, 18544 
col means = 540.7477, 540.7477 col counts = 18501, 18501 
col means = 535.9465, 535.9465 col counts = 18493, 18493 
Col means = 539.0142, 539.0142 col counts = 18528, 18528 
col means = 536.6623, 536.6623 col counts = 18496, 18496 
col means = 532.1445, 532.1445 col counts = 18486, 18486 


CON DUM BPWN PR 
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9: col means - 543.4736, 543.4736 col counts - 18479, 18479 
10: col means = 535.2980, 535.2980 col counts = 18552, 18552 
11: col means = 536.4255, 536.4255 col counts = 18486, 18486 
12: col means = 537.6472, 537.6472 col counts - 18473, 18473 
13: col means = 537.9775, 537.9715 col counts = 18511, 18511 
14: col means = 538.4742, 538.4742 col counts = 18514, 18514 
15: col means = 539.2965, 539.2965 col counts = 18497, 18497 
16: col means = 537.9710, 537.9710 col counts = 18454, 18454 
17: col means = 536.7826, 536.7826 col counts - 18566, 18566 
18: col means - 538.3274, 538.3274 col counts - 18452, 18452 
19: col means = 538.2181, 538.2181 col counts - 18491, 18491 
20: col means = 532.6881, 532.6881 col counts = 18514, 18514 
21: col means = 537.0067, 537.0067 col counts = 18554, 18554 
22: col means = 539.0643, 539.0643 col counts = 18548, 18548 


Vector Cross Products 


The next source code example, Ch13_07, demonstrates vector cross product calculations using arrays of 
three-dimensional vectors. It also illustrates how to perform data gather and scatter operations using AVX- 
512 instructions. Listing 13-7 shows the source code for example Ch13 07 


Listing 13-7. Example Ch13 07 


#pragma once 


// Simple vector structure 
typedef struct 


double X; // Vector X component 

double Y; // Vector Y component 

double Z; // Vector Z component 
) Vector; 


// Vector structure of arrays 
typedef struct 


double* X; // Pointer to X components 
double* Y; // Pointer to Y components 
double* 7; // Pointer to Z components 


) VectorSoA; 


// Ch13 07.cpp 

void InitVec(Vector* a aos, Vector* b aos, VectorSoA& a soa, VectorSoA& b soa, size t num vec); 
bool Avx512VcpAosCpp(Vector* c, const Vector* a, const Vector* b, size t num vec); 

bool Avx512VcpSoaCpp(VectorSoA* c, const VectorSoA* a, const VectorSoA* b, size t num vec); 
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// Ch13 07 .asm 

extern "C" bool Avx512VcpAos (Vector* c, const Vector* a, const Vector* b, size t num vec); 
extern "C" bool Avx512VcpSoa (VectorSoA* c, const VectorSoA* a, const VectorSoA* b, size t 
num vec); 


// Ch13 07 BM.cpp 
extern void Avx512Vcp BM(void); 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 
#include <random> 
#include <memory> 
#include "Ch13_07.h" 
#include "AlignedMem.h" 


using namespace std; 
void InitVec(Vector* a aos, Vector* b aos, VectorSoA& a soa, VectorSoA& b soa, size t num vec) 
{ 

uniform int distribution«» ui dist (1, 100}; 


default random engine rng {103}; 


for (size t i = 0; i < num vec; i++) 


( 
double a x = (double)ui dist(rng); 
double a y = (double)ui dist(rng); 
double a z = (double)ui dist(rng); 
double b x = (double)ui dist(rng); 
double b y = (double)ui dist(rng); 
double b z = (double)ui dist(rng); 
a aos[i].X = a soa.X[i] = a x; 
a aos[i].Y = a soa.Y[i] = a y; 
a aos[i].2 = a soa.Z[i] = a_z; 
b aos[i].X = b soa.X[i] = b x; 
b aos[i].Y = b soa.Y[i] = b y; 
b aos[i].Z = b soa.Z[i] = b z; 
) 
) 
void Avx512Vcp(void) 
{ 


const size_t align = 64; 
const size_t num vec = 16; 
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unique ptr«Vector» a aos up {new Vector[num vec] }; 
unique ptr«Vector» b aos up (new Vector[num vec] }; 
unique ptr«Vector» c aos up (new Vector[num vec] }; 
Vector* a aos - a aos up.get(); 
Vector* b aos - b aos up.get(); 
Vector* c aos - c aos up.get(); 


VectorSoA a soa, b soa, c soa; 

AlignedArray«double» a soa x aa(num vec, align); 
AlignedArray«double» a soa y aa(num vec, align); 
AlignedArray«double» a soa z aa(num vec, align); 
AlignedArray«double» b soa x aa(num vec, align); 
AlignedArray«double» b soa y aa(num vec, align); 
AlignedArray«double» b soa z aa(num vec, align); 
AlignedArray«double» c soa x aa(num vec, align); 
AlignedArray«double» c soa y aa(num vec, align); 
AlignedArray«double» c soa z aa(num vec, align); 


a soa.X = a soa x aa.Data(); 
a soa.Y - a soa y aa.Data(); 
a_soa.Z = a soa z aa.Data(); 
b soa.X = b soa x aa.Data(); 
b soa.Y - b soa y aa.Data(); 
b soa.Z - b soa z aa.Data(); 
c_soa.X = c soa x aa.Data(); 
C Soa.Y = c soa y aa.Data(); 
c_soa.Z = c soa z aa.Data(); 


InitVec(a aos, b aos, a soa, b soa, num vec); 


bool rci = Avx512VcpAos (c aos, a aos, b aos, num vec); 
bool rc2 = Avx512VcpSoa (&c soa, &a soa, 8b soa, num vec); 


cout << "Results for Avx512VectorCrossProd\n"; 


if (!rc1 || !rc2) 
cout << "Invalid return code - "; 
cout << "rci = " << boolalpha << rci << ", "5 
cout << "rc2 = " << boolalpha << rc2 << ", "5 


return; 


) 


cout «« fixed «« setprecision(1); 

for (size t i = 0; i < num vec; i++) 

{ 
cout << "Vector cross product #" << i << '\n'; 
const unsigned int w = 9; 


cout << a: 5 
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cout << setw(w) << a aos[i 
cout << setw(w) << a aos[i 
cout << setw(w) << a aos[i 


He H 
er pu 
<x 
NN 
^^^ 
Vis do 


cout «« " b: E 
cout << setw(w) << b aos[i].X << ' '; 
cout «« setw(w) «« b aos[i]. 
cout << setw(w) << b aos[i].Z << ‘\n'; 


< 
A 
^ 

~ 


cout << " c aos: "j; 
cout << setw(w) << c_aos[i].X << ' '; 
cout << setw(w) << c aos[i].Y << ' '; 


cout << setw(w) << c aos[i].Z << ‘\n'; 
cout << " c soa: "; 

cout << setw(w) << c soa.X[i] << H 
cout «« setw(w) «« c soa.Y[i] «« f 
cout << setw(w) << c_soa.Z[i] << '\n'; 


bool is_valid_x = c_aos[i].X == c_soa.X[i]; 
bool is_valid_y = c_aos[i].Y == c_soa.Y[i]; 
bool is_valid_z = c_aos[i].Z == c_soa.Z[i]; 

if (lis valid x || lis valid y || lis valid z) 
{ 


cout << "Compare error at index " << i << '\n'; 


cout << " is valid x = " << boolalpha << is valid x << ‘\n'; 
cout << " is valid y = " << boolalpha << is valid y << ‘\n'; 
cout << " is valid z = " << boolalpha << is valid z << ‘\n'; 
return; 
) 
) 
) 
int main() 
{ 
Avx512Vcp(); 
Avx512Vcp BM(); 
return 0; 
) 
5 mm 
; Ch13_07.asm 


include <MacrosX86-64-AVX.asmh> 


; Indices for gather and scatter instructions 
ConstVals segment readonly align(64) ‘const’ 
GS_X qword 0, 3, 6, 9, 12, 15, 18, 21 
GS_Y qword 1, 4, 7, 10, 13, 16, 19, 22 
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GS Z qword 2, 5, 8, 11, 14, 17, 20, 23 
ConstVals ends 


; extern "C" bool Avx512VcpAos (Vector* c, const Vector* a, const Vector* b, size t num 
vectors); 


.code 
Avx512VcpAos_ proc 


; Make sure num vec is valid 


xor eax,eax ;set error code (also i = 0) 

test r9,r9 

jz Done ;jump if num vec is zero 

test r9,07h 

jnz Done ;jump if num vec 4 8 != O is true 


; Load indices for gather and scatter operations 
vmovdqa64 zmm29,zmmword ptr [GS X] ;zmm29 = X component indices 
vmovdqa64 zmm30,zmmword ptr [GS Y] ;zmm30 = Y component indices 
vmovdqa64 zmm31,zmmword ptr [GS Z] ;zmm31 = Z component indices 


; Load next 8 vectors 
align 16 
(QQ kxnorb k1,k1,k1 
vgatherqpd zmmo{k1}, [rdx+zmm29*8 ] ;zmmO = A.X values 


kxnorb k2,k2,k2 
vgatherqpd zmm1{k2},[rdx+zmm30*8] ;zmm1 = A.Y values 


kxnorb k3,k3,k3 
vgatherqpd zmm2{k3}, [rdx+zmm31*8 ] ;zmm2 = A.Z values 


kxnorb k4,k4,k4 
vgatherqpd zmm3{k4}, [r8+zmm29*8 ] ;zmm3 = B.X values 


kxnorb k5,k5,k5 


vgatherqpd zmm4{k5}, [r8+zmm30*8 ] ;zmm4 = B.Y values 
kxnorb k6,k6,k6 
vgatherqpd zmm5{k6}, [r8+zmm31*8 ] ;zmm5 = B.Z values 


; Calculate 8 vector cross products 
vmulpd zmm16,zmm1, zmm5 
vmulpd zmm17,zmm2,zmm4 
vsubpd zmm18,zmm16,zmm17 jc.X = a.Y * b.Z - a.Z * b.Y 


vmulpd zmm19,zmm2,zmm3 


vmulpd zmm20,zmmO, zmm5 
vsubpd zmm21,zmm19,zmm20 ;C.Y = a.Z * b.X - a.X * b.Z 
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vmulpd zmm22,zmmO, zmm4 
vmulpd zmm23,zmm1,zmm3 
vsubpd zmm24,zmm22,zmm23 ;C.Z = a.X * b.Y - a.Y * b.X 


; Save calculated cross products 
kxnorb k4,k4,k4 
vscatterqpd [rcx+zmm29*8]{k4},zmm18 ;save C.X components 


kxnorb k5,k5,k5 
vscatterqpd [rcx+zmm30*8]{k5},zmm21 ;save C.Y components 


kxnorb k6,k6,k6 
vscatterqpd [rcx+zmm31*8]{k6},zmm24 ;save C.Z components 


; Update pointers and counters 


add rcx,192 3c t= 8 

add rdx,192 ja t= 8 

add r8,192 3b += 8 

add rax,8 ji += 8 

cmp rax,r9 

jb GB 

mov eax,1 ;set success return code 


Done:  vzeroupper 
ret 
Avx512VcpAos endp 


; extern "C" bool Avx512VcpSoa (VectorSoA* c, const VectorSoA* a, const VectorSoA* b, size t 
num vectors); 


Avx512VcpSoa_ proc frame 
_CreateFrame CP2 ,0,0,rbx,rsi,rdi,r12,r13,r14,r15 
 EndProlog 


; Make sure num vec is valid 
xor eax,eax 


test r9,r9 

jz Done ;jump if num vec is zero 

test r9,07h 

jnz Done ;junp if num vec % 8 !- 0 is true 


; Load vector array pointers and check for proper alignment 


mov r10, [rdx] ;r10 = a.X 
Or rax,r10 
mov r11, [rdx«8] Srii = a.Y 
or rax,r11 
mov r12,[rdx+16] 3112 =,a.2 
Or rax,r12 
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mov r13, [r8] 

or rax,r13 

mov r14,[r8+8] 
or rax,r14 

mov r15,[r8+16] 
or rax,r15 


mov rbx, [rcx] 

Or rax,rbx 

mov rsi,[rcx+8] 
or rax,rsi 

mov rdi, [rcx«16] 
or rax,rdi 


and rax,3fh 
mov eax,0 
jnz Done 


; Load next block (8 vectors) from a and b 


align 16 

@@: vmovapd zmmO,zmmword ptr 
vmovapd zmmi,zmmword ptr 
vmovapd zmm2,zmmword ptr 
vmovapd zmm3,zmmword ptr 
vmovapd zmm4,zmmword ptr 
vmovapd zmm5,zmmword ptr 


; Calculate cross products 
vmulpd zmm16,zmm1, zmm5 
vmulpd zmm17,zmm2,zmm4 
vsubpd zmm18,zmm16,zmm17 


vmulpd zmm19,zmm2, zmm3 
vmulpd zmm20,zmmO, zmm5 
vsubpd zmm21,zmm19,zmm20 


vmulpd zmm22,zmmO, zmm4 
vmulpd zmm23,zmm1, zmm3 


vsubpd zmm24,zmm22,zmm23 


3 Save calculated cross products 


5113 = b.X 
5114 = b.Y 
51715 = b.Z 
;ibx = c.X 
;rsi = c.Y 
;rdi = c.Z 


;misaligned component array? 
;error return code (also i - O) 


;zmmO = a.X values 
;zmm1 = a.Y values 
;zmm2 - a.Z values 
;zmm3 = b.X values 
;zmm4 - b.Y values 
;zmm5 - b.Z values 


;C.X = a.Y * b.Z - a.Z * b.Y 


;C.Y = a.Z * b.X - a.X * b.Z 


jc.Z = a.X * b.Y - a.Y * b.X 


vmovapd zmmword ptr [rbx+rax*8],zmm18  ;save C.X values 
vmovapd zmmword ptr [rsi+rax*8],zmm21 save C.Y values 
vmovapd zmmword ptr [rdi+rax*8],zmm24 ;save C.Z values 


add rax,8 
cmp rax,r9 
jb @B 
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ji t= 8 


;repeat until done 
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Done:  vzeroupper 
_DeleteFrame rbx,rsi,rdi,r12,r13,r14,115 
ret 

Avx512VcpSoa endp 
end 


The cross product of two three-dimensional vectors a and b is a third vector c that is perpendicular to 
both a and b. The x, y, and z components of c can be calculated using the following equations: 


c,-a,b -ab, c,-ab,-ajb, c, =a,b, — a,b, 


The C++ header file Ch13_07.h that’s shown in Listing 13-7 includes the structure definitions Vector 
and VectorSoA. The structure Vector contains three double-precision floating-point values—X, Y, and 
Z—that represent the components of a three-dimensional vector. The VectorSoA structure incorporates 
three pointers to double-precision floating-point arrays. Each array contains the values for a single vector 
component. Example Ch13_07 uses these structures to compare the performance of two different vector 
cross product calculating algorithms. The first algorithm performs its calculations using an array of 
structures (AOS), while the second algorithm exploits a structure of arrays (SOA). 

The C++ function Avx512Vcp begins its execution by allocating storage space for sets of vector data 
structures. This function uses the C++ template class unique_ptr<Vector> to allocate storage for three 
AOSs. Note that each Vector object is not explicitly aligned on a 64-byte boundary since doing this would 
consume a considerable about of storage space that’s never used. Each unique_ptr<Vector> AOS is also 
representative of how this type of data construct is commonly employed in many real-world programs. 
Avx512Vcp uses the C++ template class AlignedArray<double> to allocate properly aligned storage space 
for the vector SOAs. Following data structure allocation, the function InitVec initializes both sets of 
vectors a and b using random values. It then invokes the assembly language vector cross product functions 
Avx512VcpAos_ and Avx512VcpSoa_. 

Near the top of the assembly language file is a custom constant segment named ConstVals, which 
contains indices for the vgatherqpd and vscattergpd instructions that are used in Avx512VcpAos . The 
index values in this segment correspond to the memory ordering of Vector components X, Y, and Z in an 
array of Vector objects. Figure 13-4 illustrates this ordering in greater detail. Note that the indices defined in 
ConstVals enable the vgatherqpd and vscatterqpd instructions to load and save eight Vector objects. 


gather/scatter index 


U i 7% + öö 5 é VBR 4 40. M À 43 À 
eyelet ele ete tele] [ee]. 
const Vector* a a[0] a(1] a(2] a[3] a[4] 


Figure 13-4. Memory ordering of components X, Y, and Z in an array of Vector objects 


Following validation of num vec, three vmovdqa64 (Move Aligned Packed Quadword Values) instructions 
load the gather/scatter indices for Vector components X, Y, and X into registers ZMM29, ZMM30, and 
ZMMs31, respectively. The processing loop begins with a kxnorb k1,k1,k1 instruction that sets the 
low-order eight bits of opmask register K1 to one. The subsequent vgatherqpd zmmo{k1}, [rdx+zmm29*8 ] 
instruction loads eight X component values from Vector a into register ZMMO. The vgatherqpd instruction 
loads eight values since the low-order eight bits of opmask register K1 are all set to one. 
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Five more sets of kxnorb and vgatherqpd instructions load the remaining Vector components into 
registers ZMM1-ZMM5. Note that during its execution, the vgatherqpd instruction sets the entire opmask 
register to zero unless an exception occurs due to an invalid memory access, which can be caused by an 
incorrect index or bad base register value. This updating of the opmask register introduces a potential 
register dependency that is eliminated by using a different opmask register for each vgatherqpd instruction. 
The next code block calculates eight vector cross products using basic packed double-precision floating- 
point arithmetic. The cross-product results are then saved to the destination Vector array c using three 
vscatterqpd instructions. Like the vgatherqpd instruction, the vscatterqpd instruction also sets its opmask 
register operand to zero unless an exception occurs. 

The function Avx512VcpSoa begins its execution by validating num vec. It then verifies that the nine 
vector component array pointers are properly aligned on a 64-byte boundary. The processing loop in 
Avx512VcpSoa employs straightforward packed double-precision floating-point arithmetic to calculate the 
vector cross products. Note that Avx512VcpSoa uses the aligned move instruction vmovapd to perform all 
vector component loads and stores. Here are the results for source code example Ch13 07: 


Results for Avx512VectorCrossProd 


Vector cross product #0 


a: 96.0 30.0 52.0 
b: 64.0 62.0 79.0 
C aos: -854.0 -4256.0 4032.0 
C soa: -854.0 -4256.0 4032.0 
Vector cross product #1 
a: 26.0 33.0 66.0 
b: 89.0 36.0 20.0 
C aos: -1716.0 5354.0 -2001.0 
C Soa: -1716.0 5354.0 -2001.0 
Vector cross product #2 
a: 56.0 60.0 53.0 
b: 16.0 45.0 46.0 
C aos: 375.0 -1728.0 1560.0 
c_soa: 375.0 -1728.0 1560.0 
Vector cross product #3 
a: 79.0 27.0 22.0 
b: 18.0 75.0 45.0 
C aos: -435.0 -3159.0 5439.0 
c_soa: -435.0 -3159.0 5439.0 
Vector cross product #4 
a: 77.0 30.0 46.0 
b: 44.0 77.0 99.0 
c_aos: -572.0 -5599.0 4609.0 
c_soa: -572.0 -5599.0 4609.0 
Vector cross product #5 
a: 30.0 21.0 26.0 
b: 43.0 61.0 47.0 
C aos: -599.0 -292.0 927.0 
C soa: -599.0 -292.0 927.0 
Vector cross product #6 
a: 58.0 56.0 46.0 
b: 84.0 37.0 76.0 
C aos: 2554.0 -544.0 -2558.0 


474 


C Soa: 2554.0 
Vector cross product 
a: 34.0 
b: 20.0 
C aos: -3837.0 
C SOa: -3837.0 
Vector cross product 
a: 34.0 
b: 48.0 
C aos: 1165.0 
C soa: 1165.0 
Vector cross product 
a: 28.0 
b: 6.0 
C aos: -1514.0 
C Soa: -1514.0 
Vector cross product 
a: 43.0 
b: 12.0 
C aos: 2320.0 
C soa: 2320.0 
Vector cross product 
a: 53.0 
b: 78.0 
C aos: 2180.0 
C S028: 2180.0 
Vector cross product 
a: 9.0 
b: 54.0 
C aos: 3540.0 
C Soa: 3540.0 
Vector cross product 
a: 15.0 
b: 94.0 
C aos: -1047.0 
C Soa: -1047.0 
Vector cross product 
a: 95.0 
b: 45.0 
C aos: -940.0 
C SOa: -940.0 
Vector cross product 
a: 76.0 
b: 29.0 
C aos: 346.0 
C soa: 346.0 
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Running benchmark function Avx512VectorCrossProd BM - please wait 
Benchmark times save to file Ch13 07 Avx512VectorCrossProd BM CHROMIUM.csv 
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Table 13-1 shows benchmark timing measurements for the two cross product calculating functions. 
This table uses dashes to signify processors that do not support AVX-512. For source code example Ch13 07, 
the SOA technique is somewhat faster than the AOS method. 


Table 13-1. Benchmark Timing Measurements for Vector Cross 
Product Calculating Functions (1,000,000 Cross Products) 


CPU Avx512VcpAos Avx512VcpSoa 
i7-4790S — cet 

i9-7900X 4734 4141 

i7-8700K €— ene 


Matrix-Vector Multiplication 


Many computer graphics and image processing algorithms perform matrix-vector multiplications using 4 x 4 
matrices and 4 x 1 vectors. In 3D computer graphics software, these types of calculations are universally 
employed to perform affine transformations (e.g., translation, rotation, and scaling) using homogeneous 
coordinates. Figure 13-5 shows the equations that can be used to multiply a 4 x 4 matrix by a 4 x 1 vector. 
Note that the components of vector b are a simple sum-of-products calculation of the matrix’s columns 

and the individual components of vector a. Figure 13-5 also shows a sample matrix-vector multiplication 
calculation using real numbers. 


B,| [mo Mn m, MA 304 | [10 11 12 13|5 
b| [ms m, m, m,la, 564 | |20 21 22 23116 
b,| |m, m, m, m,la, &4 | 130 31 32 33| 7 
b, Bh My My Ms [lé 1084| |40 41 42 4318 
b, = nya, + Mod, + Mpa, + ma, 304 = 10(5)+11(6)+12(7)+13(8) 
b, = ma, t ma, t ma, + maa, 564 = 20(5) + 21(6) + 22(7) + 23(8) 
b, = Mya, SUO, + Mya, AG, 824 = 30(5) + 31(6) + 32(7) + 33(8) 
b, = m,a, +My a, + m,a, + maa, 1084 = 40(5) + 41(6) + 42(7) + 43(8) 
TL T T 4 
cold coll col2 cal3 


Figure 13-5. Equations for matrix-vector multiplication and a sample calculation 


Listing 13-8 shows the source code for example Ch13 08. This example demonstrates how to multiply a 
single 4 x 4 matrix with a set of 4 x 1 vectors that are stored in an array. 
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Listing 13-8. Example Ch13_08 


#pragma once 
// Simple 4x1 vector structure 
struct Vec4x1 F32 


{ 
B 


float W, X, Y, Z; 


// Ch13 08.cpp 

extern void InitVecArray(Vec4x1_F32* va, size t num vec); 

extern bool Avx512MatVecMulF32Cpp(Vec4x1_F32* vec b, float mat[4][4], Vec4x1_F32* vec a, 
size t num vec); 


// Ch13 08 .asm 
extern "C" bool Avx512MatVecMulF32 (Vec4xi F32* vec b, float mat[4][4], Vec4x1 F32* vec a, 
size t num vec); 


// Ch13 08 BM.cpp 
extern void Avx512MatVecMulF32 BM(void); 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 
#include «random» 
#include «cmath» 
#include "Ch13_08.h" 
#include "AlignedMem.h" 


using namespace std; 


bool VecCompare(const Vec4x1_F32* v1, const Vec4x1_F32* v2) 


{ 
static const float eps = 1.0e-12f; 
bool bo = (fabs(vi-»W - v2-»W) <= eps); 
bool b1 = (fabs(v1->X - v2->X) <= eps); 
bool b2 = (fabs(v1->Y - v2->Y) <= eps); 
bool b3 = (fabs(v1->Z - v2->Z) <= eps); 
return bo && bi && b2 && b3; 

} 
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void InitVecArray(Vec4x1 F32* va, size t num vec) 


( 
uniform int distribution«» ui dist (1, 500}; 
default random engine rng {187}; 
for (size t i = 0; i < num vec; i++) 
{ 
va[i].W = (float)ui dist(rng); 
va[i].X = (float)ui dist(rng); 
va[i].Y = (float)ui dist(rng); 
va[i].Z = (float)ui dist(rng); 
) 
if (num vec »- 4) 
( 
// Test values 
va[0].W = 5; va[o].X = 6; va[o].Y = 7; va[o].Z = 8; 
va[1].W = 15; va[1].X = 16; va[1].Y = 17; va[1].Z = 18; 
va[2].W = 25; va[2].X = 26; va[2].Y = 27; va[2].Z = 28; 
va[3].W = 35; va[3].X = 36; va[3].Y = 37; va[3].Z = 38; 
) 
} 
bool Avx512MatVecMulF32Cpp(Vec4x1 F32* vec b, float mat[4][4], Vec4x1_F32* vec a, size t 
num vec) 
( 
if (num vec == 0 || num vec % 4 != 0) 
return false; 
if (!AlignedMem::IsAligned(vec a, 64) || !AlignedMem::IsAligned(vec b, 64)) 
return false; 
if (!AlignedMem::IsAligned(mat, 64)) 
return false; 
for (size t i = 0; i « num vec; i++) 
( 
vec b[i].W = mat[o][o] * vec a[i].W + mat[o][1] * vec a[i].X; 
vec b[i].W += mat[0][2] * vec a[i].Y + mat[o][3] * vec a[i].Z; 
vec b[i].X = mat[1][0] * vec a[i].W + mat[1][1] * vec a[i].X; 
vec b[i].X += mat[1][2] * vec a[i].Y + mat[1][3] * vec a[i].Z; 
vec b[i].Y = mat[2][0] * vec a[i].W + mat[2][1] * vec a[i].X; 
vec b[i].Y += mat[2][2] * vec a[i].Y + mat[2][3] * vec a[i].Z; 
vec b[i].z = mat[3][0] * vec a[i].W + mat[3][1] * vec a[i].X; 
vec b[i].Z += mat[3][2] * vec a[i].Y + mat[3][3] * vec a[i].Z; 
) 
return true; 
) 
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void Avx512MatVecMulF32(void) 


{ 


const size t num vec = 8; 
alignas(64) float mat[4][4] 


10.0, 11.0, 12.0, 13.0, 
20.0, 21.0, 22.0, 23.0, 
30.0, 31.0, 32.0, 33.0, 
40.0, 41.0, 42.0, 43.0 
n 
AlignedArray<Vec4x1 F32» vec a aa(num vec, 64); 
AlignedArray«Vec4x1 F32» vec bi aa(num vec, 64); 
AlignedArray«Vec4x1 F32» vec b2 aa(num vec, 64); 
Vec4x1_F32* vec a = vec a aa.Data(); 
Vec4x1_F32* vec b1 = vec b1 aa.Data(); 
Vec4x1_F32* vec b2 = vec b2 aa.Data(); 


InitVecArray(vec a, num vec); 


bool rci = Avx512MatVecMulF32Cpp(vec b1, mat, vec a, num vec); 
bool rc2 = Avx512MatVecMulF32 (vec b2, mat, vec a, num vec); 


cout << "Results for Avx512MatVecMulF32\n"; 
if (!rc1 || !rc2) 


cout << "Invalid return code\n"; 


cout << rci = " << boolalpha << rci << '\n'; 
cout << " rc2 = " << boolalpha << rc2 << '\n'; 
return; 


} 


const unsigned int w = 8; 
cout «« fixed «« setprecision(1); 


for (size t i = 0; i < num vec; i++) 
{ 


cout << "Test case #" << i << '\n'; 


cout << "vec _b1: 


à 
cout << " " << setw(w) << vec_bi[i].W << ' '; 
cout << " " << setw(w) << vec b1[i].X << ' '; 
cout << " " << setw(w) << vec b1i[i].Y << ' '; 
cout << " " << setw(w) << vec b1[i].Z << ‘\n'; 
cout << "vec b2: "; 
cout << " " << setw(w) << vec b2[i].W << ' '; 
cout << " " << setw(w) << vec b2[i].X << ' '; 
cout << " " << setw(w) << vec b2[i].Y << ' '; 
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cout << << setw(w) << vec b2[i].Z << '\n'; 
if (!VecCompare(&vec b1[i], &vec b2[i])) 


cout << "Error - vector compare failed\n"; 


return; 
) 
) 
} 
int main() 
{ 
Avx512MatVecMulF32(); 
Avx512MatVecMulF32 BM(); 
return 0; 
) 
E mm 
í Ch13_08.asm 


3 


ConstVals segment readonly align(64) 'const' 

; Indices for matrix permutations 

MatPermO dword 0, 4, 8, 12, 0, 4, 8, 12, 0, 4, 8, 12, 0, 4, 8, 12 
MatPermi  dword 1, 5, 9, 13, 1, 5, 9, 13, 1, 5, 9, 13, 1, 5, 9, 13 
MatPerm2  dword 2, 6, 10, 14, 2, 6, 10, 14, 2, 6, 10, 14, 2, 6, 10, 14 
MatPerm3  dword 3, 7, 11, 15, 3, 7, 11, 15, 3, 7, 11, 15, 3, 7, 11, 15 


; Indices for vector permutations 

VecPermo dword 0, 0, 0, 0, 4, 4, 4, 4, 8, 8, 8, 8, 12, 12, 12, 12 
VecPerm1 dword 1, 1, 1, 1, 5, 5, 5, 5, 9, 9, 9, 9, 13, 13, 13, 13 
VecPerm2  dword 2, 2, 2, 2, 6, 6, 6, 6, 10, 10, 10, 10, 14, 14, 14, 14 
VecPerm3 dword 3, 3, 3, 3, 7, 7, 7, 7, 11, 11, 11, 11, 15, 15, 15, 15 
ConstVals ends 


; extern "C" bool Avx512MatVecMulF32_(Vec4x1_F32* vec b, float mat[4][4], Vec4x1_F32* vec a, 
size t num vec); 


.code 

Avx512MatVecMulF32 proc 
xor eax,eax ;set error code (also i = 0) 
test r9,r9 
jz Done ;jump if num vec is zero 
test r9,3 
jnz Done ;junp if n% 4 !=0 


test rcx,3fh 


jnz Done ;junp if vec b is not properly aligned 
test rdx,3fh 

jnz Done ;jump if mat is not properly aligned 
test r8,3fh 

jnz Done ;junp if vec a is not properly aligned 
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; Load permutation indices for matrix columns and vector elements 
vmovdqa32 zmm16,zmmword ptr [MatPermo] ;mat col O indices 
vmovdqa32 zmm17,zmmword ptr [MatPerm1] ;mat col 1 indices 
vmovdqa32 zmm18,zmmword ptr [MatPerm2] ;mat col 2 indices 
vmovdqa32 zmm19,zmmword ptr [MatPerm3] ;mat col 3 indices 


] 
] 
] 
] 


vmovdqa32 zmm24,zmmword ptr [VecPermo] ;W component indices 
vmovdqa32 zmm25,zmmword ptr [VecPerm1] ;X component indices 
vmovdqa32 zmm26,zmmword ptr [VecPerm2] ;Y component indices 
vmovdqa32 zmm27,zmmword ptr [VecPerm3] ;Z component indices 


; Load source matrix and duplicate columns 


vmovaps zmmO,zmmword ptr [rdx] ;zmmo = mat 


vpermps zmm20,zmm16, zmmO ;zmm20 = mat col O (4x) 
vpermps zmm21,zmm17,zmmO ;zmm21 - mat col 1 (Ax) 
vpermps zmm22,zmm18,zmmo ;zmm22 = mat col 2 (4x) 
vpermps zmm23,zmm19,zmmo ;zmm23 = mat col 3 (4x) 


; Load the next 4 vectors 

align 16 
QQ: vmovaps zmm4,zmmword ptr [r8+rax]  ;zmm4 = vec a (4 vectors) 
; Permute the vector elements for subsequent calculations 

vpermps zmmo,zmm24,zmm4 ;zmmO = vec a W components 


vpermps zmm1,zmm25,zmm4 ;zmmi = vec a X components 
vpermps zmm2,zmm26,zmm4 ;zmm2 = vec a Y components 
vpermps zmm3,zmm27,zmm4 ;zmm3 = vec a Z components 


; Perform matrix-vector multiplications (4 vectors) 
vmulps zmm28,zmm20,zmmo 
vmulps zmm29,zmm21,zmmi 
vmulps zmm30,zmm22,zmm2 
vmulps zmm31,zmm23,zmm3 
vaddps zmm4,zmm28,zmm29 
vaddps zmm5,zmm30,zmm31 
vaddps zmm4,zmm4, zmm5 ;zmm4 - vec b (4 vectors) 


vmovaps zmmword ptr [rcx«rax],zmm4  ;save result 


add rax,64 ;rax - offset to next block of 4 vectors 
sub r9,4 

jnz GB ;repeat until done 

mov eax,1 ;set success code 


Done:  vzeroupper 
ret 
Avx512MatVecMulF32  endp 
end 
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The C++ code in Listing 13-8 begins with header file Ch13 08.h that contains the requisite function 
declarations. This file also includes a declaration for the structure Vec4x1 F32, which incorporates the 
four components of a 4 x 1 column vector. The source code file Ch13 08.cpp includes a function named 
Avx512MatVecMulF32Cpp. This function implements the matrix-vector multiplication equations that are 
shown in Figure 13-5. The remaining C++ code in Listing 13-8 performs test case initializations, invokes the 
calculating functions, and displays the results. 

The assembly language code in Listing 13-8 starts with a constant data segment that defines a series of 
packed permutation indices. The assembly language implementation of the matrix-vector multiplication 
algorithm uses these values to reorder the elements of the source matrix and vectors. The reason for 
this reordering is to facilitate the simultaneous calculation of four matrix-vector products. The function 
Avx512MatVecMulF32_ begins its execution by validating num vec for divisibility by four. It then checks the 
matrix and vector buffer pointers for proper alignment on a 64-byte boundary. 

Following argument validation, four vmovdqa32 instructions load the matrix permutation indices into 
registers ZMM16-ZMM19. This is followed by another series of four vmovdqa32 instructions that load the 
vector permutation indices into registers ZMM24-ZMM?7. The ensuing vmovaps zmmO, zmmword ptr [rdx] 
instruction loads all 16 single-precision floating-point elements of matrix mat into ZMMO. The vpermps 
zmm20, zmm16 , zmmO instruction (Permute Single-Precision Floating-Point Elements) rearranges the elements 
in ZMMO according to the indices in ZMM16. Execution of this instruction loads four copies of column 0 from 
matrix mat into register ZMM20. Three more vpermps instructions are then employed to perform the same 
operation using columns 1, 2, and 3. Figure 13-6 elucidates the execution of these permutations in greater detail. 


Matrix m{3](0] m{2}{0] mlalo] mlollo] 


EOEXESESESEJEJEJEDESEDESEEQEDED ES 


Matrix permutation indices 


[ESSERE ESSE ESSE EI ES 898 89 FER ERST RESI] oe 


ESIESIESIESIESSEENE LES ESTE fn EEE vor 
SOS [S ESSI EST RE IERI E me 


zmm19 
vpermps zmm20,zmm16, zmm0 ;zmm20 = mat col O (4x) 
jave | aa | zo | 100 |an | saa fz00 | a00 f aoi] a00 | zoi | to | aoa ES E | immo 
vpermps zmm21,2zmm17,zmmO ;zmm21 = mat col 1 (4x) 
ESESESEDESESESESESESESESEDEDEDE S 
vpermps zmm22,zmml8,zmmO ;zmm22 = mat col 2 (4x) 
ESESJESESESESESESEJESESESESESESES LUE 
vpermps zmm23,zmml9,zmmO ;zmm23 = mat col 3 (4x) 


EE ee EE ERES ESSERE E n> 


Figure 13-6. Permutation of matrix columns using vpermps instructions 
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The processing loop in Avx512MatVecMulF32 starts with a vnovaps zmm4,zmmword ptr [r8+rax] 
instruction that loads four Vec4x1 F32 vectors into register ZMM4. The W, X, Y, and Z components of 
these vectors are then regrouped using another series of vpermps instructions. Following execution of 
these instructions, registers ZMM0-ZMM3G contain repeated sets of the vector components, as shown in 
Figure 13-7. 


Vectors 
a[3.2 e[3]Y a[3]X a[3].W a[2]2 a[2].Y a[2]X a[2]W af1].Z a[1]Y a[1LX a[1).W a[O]2 a[0].Y a[0].X a[0].W 


EJEDEDEQEQEIEQEIEIEIEIEIEREREIHED DS 


Vector permutation indices 


[EIE EIL E RS LIST ee CT IER ee IE = 
Eae ae > 


ES ee ee) re 
ESS SSII RE 


vpermps zmmO,zmm24,zmmA ;zmm0 = vec a W components 


vpermps zmml, zmm25, zmm4 ;zmml = vec a X components 


vpermps zmm2,zmm26, zmm4 ;zmm2 = vec a Y components 


[ve] | [ve [ns [ma [ma] me] v9] vo] vo ]ve [ve [om [79 [20] nm 


vpermps zmm3,zmm27,zmm4 ;zmm3 — vec a Z components 


Figure 13-7. Permutation of vector components using vpermps instructions 


Following the vector component permutations, a series of vnulps and vaddps instructions carry out 
four simultaneous matrix-vector multiplications. Figure 13-8 illustrates this operation in greater detail. The 
ensuing vmovaps zmmword ptr [rcx«rax],zmm4 instruction saves the four resultant 4 x 1 vectors in the 
vec barray. The processing loop then repeats until all vectors in vec a have been processed. 
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Matrix values (zmm20 - zmm23) 


se a E ES | a) E] ES ESSE E [a 
Se SSSR ESSE ES 
Sete ee ES EIE a 
SRE) 


Vector components (zmmO — zmm3) 


ele 59] 30 [so] so [so [so [ sa e [0 [0] on 
pa es os os [os [os [oo [os ow ws ow ows] oo [oo [oo [0] 
spes [m= [me [me Tes [0 Tes [0 [9 [me Tes T5 5T] 


vmulps zmm28,zmm20, zmm0 


Fs ce Fe ES ESSI ESSI | amas 


vmulps zmm29,zmm21,zmml 


1476.0|1116.0| 756.0 EH 1066.0| 806.0 [sa&o | 286.0 | 6560 | 4960 | 336.0 176.0 | 246.0 186.0 | 1260 | 660 | zmm29 


vmulps zmm30, zmm22, zmm2 


BE Ee ee Ec 


vmulps zmm31,zmm23, zmm3 


see ous esol] EE D [cons] n 


vaddps zmm4,zmm28, zmm29 
vaddps zmm5, zmm30, zmm31 
vaddps zmm4,zmm4, zmm5 ;zmm4 = vec b (4 vectors) 


momTMmMEmTTETTTPEEET 


b[3.Z b[3.Y b[3]X b[3.W b[2]2 b[2]Y b{21X b[2].W b[1]Z b[1].Y b[1]X b[1.W b[0]Z b[O].Y b[O].X b[O].W 


Figure 13-8. Matrix-vector multiplications using vmulps and vaddps 


The output for source code example Ch13 08 follows this paragraph. Table 13-2 shows benchmark 
timing measurements for the C++ and assembly language matrix-vector multiplication functions. 
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Results for Avx512MatVecMulF32 
Test case #0 


vec b1: 304.0 564.0 824.0 1084.0 
vec b2: 304.0 564.0 824.0 1084.0 
Test case #1 

vec bi: 764.0 1424.0 2084.0 2744.0 
vec b2: 764.0 1424.0 2084.0 2744.0 
Test case #2 

vec bi: 1224.0 2284.0 3344.0 4404.0 
vec b2: 1224.0 2284.0 3344.0 4404.0 
Test case #3 

vec b1: 1684.0 3144.0 4604.0 6064.0 
vec b2: 1684.0 3144.0 4604.0 6064.0 
Test case #4 

vec b1: 11932.0 22452.0 32972.0 43492.0 
vec b2: 11932.0 22452.0 32972.0 43492.0 


Test case #5 
vec bi: 17125.0 31705.0 46285.0 60865.0 
vec b2: 17125.0 31705.0 46285.0 60865.0 
Test case #6 
vec bi: 12723.0 23873.0 35023.0 46173.0 
vec b2: 12723.0 23873.0 35023.0 46173.0 
Test case #7 
vec b1: 15121.0 27871.0 40621.0 53371.0 
vec b2: 15121.0 27871.0 40621.0 53371.0 


Running benchmark function Avx512MatVecMulF32 BM - please wait 
Benchmark times save to file Ch13 08 Avx512MatVecMulF32 BM CHROMIUM.csv 


Table 13-2. Benchmark Timing Measurements for Matrix-Vector 
Multiplication Functions (1,000,000 Vectors) 


CPU Avx512MatVecMulF32Cpp  Avx512MatVecMulF32_ 
i7-4790S — "s 

i9-7900X | 6174 1778 

i7-8700K — ---- aod 
Convolutions 


Listing 13-9 shows the source code for example Ch13 09. This example is an AVX-512 implementation of 
the convolution program that was presented in source code example Ch11 02. The primary purpose of this 
example is to highlight the conversion of functions that use AVX2 instructions to ones that exploit AVX-512 
instructions. It also provides an opportunity to compare benchmark timing measurements between the 
AVX2 and AVX-512 implementations of the convolution functions. 
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Listing 13-9. Example Ch13 09 


include «MacrosX86-64-AVX.asmh» 
extern c NumPtsMin:dword 

extern c NumPtsMax:dword 

extern c KernelSizeMin:dword 
extern c KernelSizeMax:dword 


; extern bool Avx512Convolve2 (float* y, const float* x, int num pts, const float* kernel, 
int kernel size) 


.code 

Avx512Convolve2_ proc frame 
_CreateFrame CV2 ,0,0,rbx 
_EndProlog 


; Validate argument values 
xor eax,eax ;set error code 


mov ri0d,dword ptr [rbp+CV2 OffsetStackArgs] 
test r10d,1 


jz Done ;kernel size is even 
cmp r10d,[c KernelSizeMin] 

jl Done ;kernel size too small 
cmp r10d,[c KernelSizeMax] 

jg Done ;kernel size too big 


cmp r8d,[c NumPtsMin] 


jl Done ;num pts too small 

cmp r8d,[c NumPtsMax] 

jg Done ;num pts too big 

test r8d,15 

jnz Done ;num pts not even multiple of 16 


test rcx,3fh 
jnz Done ;y is not properly aligned 


; Initialize convolution loop variables 


shr r10d,1 ;110 = kernel size / 2 (ks2) 
lea rdx,[rdx+r10*4] ;Ydx = x + ks2 (first data point) 
xor ebx,ebx id e 
; Perform convolution 
LP1: vxorps zmmo,zmmo,zmmo ;packed sum = 0.0; 
mov r11,r10 5111 = ks2 
neg r11 ;K = -ks2 
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LP2: mov rax,rbx STax = i 
sub rax,r11 jyax = i - k 
vmovups zmm1,zmmword ptr [rdx+rax*4] ;load x[i - k]:x[i - k + 15] 


mov rax,r11 


add rax,r10 jrax = k + ks2 

vbroadcastss zmm2,real4 ptr [r9+rax*4] ;zmm2 = kernel[k + ks2] 

vfmadd231ps zmmo,zmm1,zmm2 ;zmmo += x[i-k]:x[i-k+15] * kernel[k+ks2] 
add r11,1 jk += 1 

cmp r11,r10 

jle LP2 ;repeat until k » ks2 


vmovaps zmmword ptr [rcx«rbx*4],zmmo ;save y[i]:y[i + 15] 


add rbx,16 ;i += 16 

cmp rbx,r8 

jl LPT ;repeat until done 

mov eax,1 ;set success return code 


Done:  vzeroupper 
 DeleteFrame rbx 
ret 

Avx512Convolve2 endp 


; extern bool Avx512Convolve2Ks5 (float* y, const float* x, int num pts, const float* 
kernel, int kernel size) 


Avx512Convolve2Ks5_ proc frame 
_CreateFrame CKS5 ,0,48 
_SaveXmmRegs xmm6, xmm7 , xmm8 
_EndProlog 


; Validate argument values 
xor eax, eax ;set error code (rax is also loop index var) 


cmp dword ptr [rbp+CKS5 OffsetStackArgs],5 
jne Done ;jump if kernel size is not 5 


cmp r8d,[c NumPtsMin] 


jl Done ;jump if num pts too small 

cmp r8d,[c NumPtsMax] 

jg Done ;jump if num pts too big 

test r8d,15 

jnz Done ;num pts not even multiple of 15 


test rcx,3fh 
jnz Done ;y is not properly aligned 


487 


CHAPTER 13 = AVX-512 PROGRAMMING — FLOATING-POINT 


; Perform required initializations 


vbroadcastss zmm4,real4 ptr [r9] ;kernel[0] 
vbroadcastss zmm5,real4 ptr [r944] ;kernel[1] 
vbroadcastss zmm6,real4 ptr [r948] ;kernel[2] 
vbroadcastss zmm7,real4 ptr [r9412] ;kernel[3] 
vbroadcastss zmm8,real4 ptr [r9416] ;kernel[4] 
mov r8d,r8d 518 = num pts 
add rdx,8 5X += 2 


; Perform convolution 
@@: vxorps zmm2,zmm2,zmm2 ;initialize sum vars 
vxorps zmm3,zmm3,zmm3 


mov r11,rax 


add r11,2 5) = i + ks2 
vmovups zmmo,zmmword ptr [rdx+r11*4] ;zmmo = x[j]:x[j + 15] 
vfmadd231ps zmm2,zmmo,zmm4 ;zmm2 += x[j]:x[j + 15] * kernel[o] 


vmovups zmm1,zmmword ptr [rdx+r11*4-4] ;zmmi = x[j - 1]:x[j + 14] 
vfmadd231ps zmm3,zmm1,zmm5 ;zmm3 += x[j - 1]:x[j + 14] * kernel[1] 


vmovups zmmO,zmmword ptr [rdx+r11*4-8] ;zmmo = x[j - 2]:x[j + 13] 
vfmadd231ps zmm2,zmmO, zmm6 ;zmm2 += x[j - 2]:x[j + 13] * kernel[2] 


vmovups zmm1,zmmword ptr [rdx+r11*4-12] ;zmmi = x[j - 3]:x[j + 12] 
vfmadd231ps zmm3,zmm1,zmm7 ;zmm3 += x[j - 3]:x[j + 12] * kernel[3] 


vmovups zmmO,zmmword ptr [rdx+r11*4-16] ;zmmo = x[j - 4]:x[j + 11] 
vfmadd231ps zmm2,zmmo,zmm8 ;zmm2 += x[j - 4]:x[j + 11] * kernel[4] 


vaddps zmmo, zmm2, zmm3 ;final values 
vmovaps zmmword ptr [rcx«rax*4],zmmo ;save y[i]:y[i + 15] 


add rax,16 ;i += 16 

cmp rax,r8 

jl GB ;junp if i < num pts 

mov eax,1 ;set success return code 


Done: | vzeroupper 
 RestoreXmmRegs xmm6 ,xmm7 ,xmm8 
_DeleteFrame 
ret 
Avx512Convolve2Ks5_ endp 
end 


The C++ portion of source code example Ch13 09 is not shown in Listing 13-9 since it's almost identical 


to the C++ code in example Ch11 02. Modifications made in the Ch13 09 C++ code include a few function 
name changes. The test arrays are also allocated on a 64-byte instead of a 32-byte boundary. 
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The assembly language function Avx512Convolve2 implements the variable-size kernel convolution 
algorithm that's described in Chapter 11. The primary difference between this function and its AVX2 
counterpart Convolve2 (see Listing 11-2) is the use of ZMM registers instead of YMM registers. The code 
that adjusts the index counter in register RBX was also modified to reflect the processing of 16 data points 
per iteration instead of 8. Similar changes were also made to the fixed-size kernel convolution function 
Avx512Convolve2Ks5 . 

The output for source code example Ch13 09 is not shown since it's the same as the output for 
source code example Ch11 02. Table 13-3 shows the benchmark timing measurements for functions 
Avx512Convolve2 and Avx512Convolve2Ks5_. This table also includes the benchmark timing 
measurements for the AVX2 functions Convolve2 and ConvolveKs2 from Table 11-2. The AVX-512 
implementations are faster than their AVX2 counterparts, especially for the size-independent convolution 
function Avx512Convolve2 .It would, of course, be imprudent to extrapolate any general conclusions 
regarding AVX-512 versus AVX2 performance based solely on the timing measurements shown in Table 13-3. 
You'll see other examples in Chapter 14. 


Table 13-3. Mean Execution Times (Microseconds) for AVX2 and AVX-512 Convolution Functions Using 
Five-Element Convolution Kernel (2,000,000 Signal Points) 


CPU Convolve2 Avx512Convolve2 Convolve2Ks5 Avx512Convolve2Ks5 
i7-4790S 1244 — ----- 1067 ---- 
i9-7900X 956 757 719 693 
i7-8700K 859  ----- 595 T 


Summary 


Here the key learning points for Chapter 13. 


e When using merge masking with scalar or packed operands, the processor carries 
out the instruction's calculation only if the corresponding opmask register bit is set 
to one. Otherwise, no calculation is performed and the destination operand element 
remains unchanged. 


e — AVX-512 assembly language functions can use an opmask register destination 
operand with most instructions that perform scalar or packed compare operations. 
The bits of the opmask register can then be employed to effect data-driven logic 
decisions sans any conditional jump instructions using either merge or zero masking 
and (if necessary) simple Boolean operations. 


e — AVX-512 assembly language functions must use the vmovdqu[32 |64] and 
vmovdqa[32 |64] instructions to perform move operations using 512-bit wide packed 
doubleword and quadword integer operands. These instructions can also be used 
with 256-bit and 128-bit wide operands. 


e Unlike AVX and AVX2, AVX-512 includes instructions that perform conversions 
between floating-point and unsigned integer operands. 


e — AVX-512 functions should ensure that packed 128-, 256-, and 512-bit wide operands 
are aligned on a proper boundary whenever possible. 
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Assembly language functions that use AVX-512 instructions with registers ZMMO- 
ZMM15 or YMM0-YMMIS register operands should always use a vzeroupper 
instruction before program control is transferred back to the calling function. 


Assembly language functions and algorithms that employ a structure of arrays are 
often faster than those that use an array of structures. 


The Visual C++ calling convention treats AVX-512 registers ZMM16-ZMM31, 
YMM16-YMM31, and XMM16-XMM31 as volatile across function boundaries. This 
means that a function can use these registers without needing to preserve their 
values. 


CHAPTER 14 


AVX-512 Programming - Packed 
Integers 


In Chapters 7 and 10, you learned how to use the AVX and AVX2 instruction sets to perform packed 

integer operations using 128-bit and 256-bit wide operands. In this chapter, you learn how to use AVX-512 
instructions set to carry out packed integer operations using 512-bit wide operands. You also learn how 

to use AVX-512 instructions with 256-bit and 128-bit wide packed integer operands. The first source code 
example explains how to perform basic packed integer arithmetic using ZMM registers. This is followed by 
several examples that exemplify image-processing algorithms and techniques using AVX-512 instructions. 
Like the previous chapter, all of source code examples in this chapter require a processor and operating 
system that support AVX-512 and the following instruction set extensions: AVX512F, AVX512CD, AVX512BW, 
AVX512DQ, and AVX512VL. You can use one ofthe freely available utilities listed in Appendix A to determine 
whether your system supports these extensions. 


Basic Arithmetic 


Listing 14-1 shows the source code for example Ch14 01. This example demonstrates how to perform basic 
packed integer arithmetic using 512-bit wide operands and the ZMM register set. 


Listing 14-1. Example Ch14 01 


#include "stdafx.h" 
#include <cstdint> 
#include <iostream> 
#include «iomanip» 
#include "Zmmval.h" 


using namespace std; 


extern "C" void Avx512PackedMath116 (const ZmmVal* a, const ZmmVal* b, ZmmVal c[6]); 
extern "C" void Avx512PackedMathI64 (const ZmmVal* a, const ZmmVal* b, ZmmVal c[5], 
uint32 t opmask); 


O Daniel Kusswurm 2018 491 
D. Kusswurm, Modern X86 Assembly Language Programming, 
https://doi.org/10.1007/978-1-4842-4063-2 14 


CHAPTER 14 = AVX-512 PROGRAMMING — PACKED INTEGERS 


void Avx512PackedMath116(void) 


{ 
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alignas(64) ZmmVal a; 
alignas(64) ZmmVal b; 
alignas(64) ZmmVal c[6]; 


D CU VU VV DY CU WY D CU OV CU DY CU GC 


D CU &U CU YY CU WY 


a 
a 
a 
a.m. 
a 
a 
a 
a 


10; 
20; 
30; 
40; 
50; 
60; 
70; 
80; 


1000; 

2000; 
3000; 
4000; 
5000; 
6000; 
7000; 
8000; 


-1000; 
-2000; 
3000; 

4000; 

-5000; 
-6000; 
-7000; 
-8000; 


30000; 
6000; 
-25000; 
8000; 
2000; 
4000; 
-3000; 
-15000; 


ooo oo © © c 


b 
b 
b 
b 
b 
b 
b 
b 


b 
b 
b 
b. 
b. 
b 
b 
b 


Occ OS © © © c 


.m I16[6] 
.m I16[7] 


.m I16[8] 
.m I16[9] 
.m I16[10 
.m I16[11 


.m I16[16 
.m I16[17 
.m I16[18 
.m I16[19 
.m I16[20 
.m I16[21 
.m I16[22 
.m I16[23 


.m I16[24 
.m I16[25 
.m I16[26 
.m I16[27 
.m I16[28 
.m I16[29 
.m I16[30 
.m I16[31 


Avx512PackedMathI16 (&a, &b, c); 


cout <<"\nResults for Avx512PackedMath116\n\n"; 


cout «« 
cout «« 


i 


a 


b  vpaddw vpaddsw 


L3 Lacu ua uuu 


] 
] 
] 
] 
] 
] 
] 
] 


100; 
200; 
300; 
400; 
500; 
600; 
700; 
800; 


-100; 

200; 
-300; 
400; 
-500; 
600; 
-700; 
800; 


100; 
-200; 
303; 
-400; 
500; 
-600; 
700; 
800; 


3000; 
32000; 
-27000; 
28700; 
-31000; 
-30000; 
32000; 
24000; 


// 
// 
// 
// 
// 
// 
// 
// 


add overflow 
add overflow 
add overflow 
add overflow 
sub overflow 
sub overflow 
sub overflow 
sub overflow 


vpsubw vpsubsw vpminsw vpmaxsw\n"; 


<< 
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<< hex << opmask << dec << ‘\n'; 


vpsubq vpmullq vpsllvq vpsravq 
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vpabsq\n"; 


DR" \n"; 


<< 
<< 
<< 
<< 
<< 
<< 


3 
CS 


for (int i = 0; i < 32; ie) 
( 
cout << setw(2) << i << ' ' 
cout << setw(8) << a.m I16[i] << 
cout << setw(8) << b.m I16[i] << 
cout << setw(8) << c[0].m I16[i] 
cout << setw(8) << c[1].m 116[i] 
cout << setw(8) << c[2].m I16[i] 
cout << setw(8) << c[3].m I16[i] 
cout << setw(8) << c[4].m 116[i] 
cout << setw(8) << c[5].m I16[i] 
} 
} 
void Avx512PackedMath164(void) 
{ 
alignas(64) ZmmVal a; 
alignas(64) ZmmVal b; 
alignas(64) ZmmVal c[6]; 
uint32 t opmask = Ox7f; 
a.m_I64[0] = 64; b.m I64[0] = 
a.m I64[1] = 1024; b.m I64[1] - 
a.m I64[2] = -2048; b.m I64[2] - 
a.m I64[3] - 8192; b.m I64[3] = 
a.m I64[4] - -256; b.m I64[4] = 
a.m I64[5] = 4096; b.m I64[5] = 
a.m I64[6] = 16; b.m I64[6] - 
a.m I64[7] = 512; b.m I64[7] = 
Avx512PackedMathI64 (8a, 8b, c, opmask); 
cout << "\nResults for Avx512PackedMathI64 An Nn"; 
cout «« "op mask - 
cout «« " i a b  vpaddq 
cout «« " 
for (int i = 0; i < 8; ie) 
( 
cout << setw(2) << i << ' '; 
cout << setw(6) << a.m I64[i] << ' '; 
cout << setw(6) << b.m I64[i] << 
cout << setw(8) << c[o].m Ie4[i] 
cout << setw(8) << c[1].m I64[i] 
cout << setw(8) << c[2].m I64[i] 
cout << setw(8) << c[3].m 164[i] 
cout << setw(8) << c[4].m Ie4[i] 
cout << setw(8) << c[5].m I64[i] 
} 
} 
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int main() 


( 
Avx512PackedMathI16(); 


Avx512PackedMath164() ; 
return 0; 


3 extern "C" void Avx512PackedMathI16 (const ZmmVal* a, const ZmmVal* b, ZmmVal c[6]) 


. code 
Avx512PackedMathI16_ proc 
vmovdqui6 zmmo,zmmword ptr [rcx] ;zmmo = a 
vmovdqui6 zmm1,zmmword ptr [rdx] ;zmm1 = b 
; Perform packed word operations 
vpaddw zmm2,zmmo, zmm1 ;add 
vmovdqa64 zmmword ptr [r8],zmm2 ;save vpaddw result 
vpaddsw zmm2, zmmo, zmm1 ;add with signed saturation 
vmovdqa64 zmmword ptr [r8+64],zmm2 ;save vpaddsw result 
vpsubw zmm2,zmmO, zmm1 ;sub 
vmovdqa64 zmmword ptr [r84128],zmm2 ;save vpsubw result 
vpsubsw zmm2,zmmO, zmm1 ;sub with signed saturation 
vmovdqa64 zmmword ptr [r84192],zmm2 ;save vpsubsw result 
vpminsw zmm2,zmmo,zmmi ;signed minimums 
vmovdqa64 zmmword ptr [r84256],zmm2 ;save vpminsw result 
vpmaxsw zmm2,zmmo,zmmi ;signed maximums 
vmovdqa64 zmmword ptr [r84320],zmm2 ;save vpmaxsw result 
vzeroupper 
ret 


Avx5i2PackedMathI16  endp 


; extern "C" void Avx512PackedMathI64 (const ZmmVal* a, const ZmmVal* b, ZmmVal c[5], 
unsigned int opmask) 


Avx512PackedMathI64_ proc 


vmovdqa64 zmmo,zmmword ptr [rcx] ;zmmo = a 

vmovdqa64 zmm1,zmmword ptr [rdx] ;zmm1 = b 

and r9d,0ffh ;19d = opmask value 
kmovb k1,r9d 3k1 = opmask 
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; Perform packed quadword operations 


vpaddq zmm2{k1}{z},zmmO, zmm1 ;add 
vmovdqa64 zmmword ptr [r8],zmm2 ;save vpaddq result 
vpsubq zmm2{k1}{z},zmmO, zmm1 ;sub 
vmovdqa64 zmmword ptr [r8+64],zmm2 ;save vpsubq result 
vpmullq zmm2{k1}{z},zmmo, zmm1 ;signed mul (low 64 bits) 
vmovdqa64 zmmword ptr [r84128],zmm2 ;save vpmullq result 
vpsllvq zmm2{k1}{z},zmmo, zmm1 ;shift left logical 
vmovdqa64 zmmword ptr [r84192],zmm2 ;save vpsllvq result 
vpsravq zmm2(k1)(z), zmmo, zmm1 ;shift right arithmetic 
vmovdqa64 zmmword ptr [r84256],zmm2 ;save vpsravq result 
vpabsq zmm2{k1}{z},zmmo ;absolute value 
vmovdqa64 zmmword ptr [r84320],zmm2 ;save vpabsq result 
vzeroupper 
ret 

Avx512PackedMathI64_ endp 
end 


The C++ functions Avx512PackedMathI16 and Avx512PackedMathI64 are the base routines that handle 
AVX-512 packed integer operations using word and quadword values. Each function begins its execution by 
initializing the applicable integer elements of two ZmmVal variables. Note that the C++ alignas(64) specifier 
is used with each ZmmVal. Following variable initialization, each base routine invokes its corresponding 
assembly language function: Avx512PackedMathI16_ or Avx512PackedMathI64_. The results are then 
streamed to cout. 

The assembly language function Avx512PackedMathI16 starts its execution with two vmovdqa64 
instructions that load ZmmVal variables a and b into registers ZMMO and ZMMI, respectively. Somewhat 
surprisingly, AVX512BW does not include aligned move instructions for 512-bit wide packed byte 
and word operands. Another alternative here would be to use the vmovdqu16 instruction. Note that 
this latter instruction must be used in cases where merge or zero masking is required. AVX512BW 
also includes a vmovdqu8 instruction for 512-bit wide packed byte operands. Following operand value 
loading, Avx512PackedMathI16 demonstrates the packed word instructions vpaddw, vpaddsw, vpsubw, 
vpsubsw, vpminsw, vpmaxsw. Each 512-bit packed word result is then saved in the array c. Note that 
Avx512PackedMathI16_ uses a vzeroupper instruction prior to its ret instruction. 

The assembly language function Avx512PackedMathI64_ exemplifies various arithmetic operations 
using 512-bit wide packed quadword instructions. Note that this function includes an argument value 
named opmask, which is employed to highlight packed quadword zero masking. Avx512PackedMathI64 
also uses a vzeroupper instruction prior to its ret instruction. Here are the results for source code example 
Chi4 01. 
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Results for Avx512PackedMathI16 


WON AU BR W N° HR © 


M Mo B HB B HB HB HB B B BRB 
| © © © — O U1 BB WU N H © 


22 
23 
24 
25 
26 
27 
28 
29 
30 
31 


1000 
2000 
3000 
4000 
5000 
6000 
7000 
8000 
-1000 
-2000 
3000 
4000 
- 5000 
-6000 
-7000 
-8000 
30000 
6000 
25000 
8000 
2000 
4000 
-3000 
-15000 


3 
-2 
2 
=3 
=3 
3 
2 


b vp 


3000 -3 
2000 -2 
7000 1 
8700 -2 
1000 -2 
0000 -2 
2000 2 
4000 


addw vpaddsw 
110 110 
220 220 
330 330 
440 440 
550 550 
660 660 
770 770 
880 880 
900 900 
2200 2200 
2700 2700 
4400 4400 
4500 4500 
6600 6600 
6300 6300 
8800 8800 
-900 -900 
2200 -2200 
3303 3303 
3600 3600 
4500 -4500 
6600 -6600 
6300 -6300 
7200 -7200 
2536 32767 
7536 32767 
3536 -32768 
8836 32767 


9000 -29000 
6000 -26000 
9000 29000 
9000 9000 


Results for Avx512PackedMathI64 
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Image Processing 


The source code examples in this section explicate image-processing algorithms and techniques using 
AVX-512 packed integer instructions. Most of the source code examples are updated versions of examples 
from earlier chapters that exploited AVX or AVX2 instructions. Besides exemplifying AVX-512 packed integer 
instruction usage, the source code examples that follow also accentuate alternative algorithmic approaches 
and instruction sequences that often result in improved performance. 


Pixel Conversions 


In Chapter 7, you learned how to use the AVX instruction set to convert unsigned 8-bit pixels to single- 
precision floating-point pixels and vice versa (see example Ch07 06). Source code example Ch14 02 
demonstrates how to carry out these same conversions using AVX-512 instructions. Listing 14-2 shows the 
source code for example Ch14 02. 


Listing 14-2. Example Ch14 02 


#include "stdafx.h" 
#include <iostream> 
#include «iomanip» 
#include <cstdint> 
#include «random» 
#include "AlignedMem.h" 


using namespace std; 


// Ch14 02 Misc.cpp 

extern bool Avx512ConvertImgU8ToF32Cpp(float* des, const uint8 t* src, uint32 t num pixels); 
extern bool Avx512ConvertImgF32ToU8Cpp(uint8 t* des, const float* src, uint32 t num pixels); 
extern uint32 t Avx512ConvertImgVerify(const float* srci, const float* src2, uint32 t 

num pixels); 

extern uint32 t Avx512ConvertImgVerify(const uint8 t* src1, const uint8 t* src2, uint32 t 
num pixels); 


// Ch14 02 .asm 
extern "C" bool Avx512ConvertImgU8ToF32 (float* des, const uint8 t* src, uint32 t num pixels); 
extern "C" bool Avx512ConvertImgF32ToU8 (uint8 t* des, const float* src, uint32 t num pixels); 


void InitU8(uint8 t* x, uint32 t n, unsigned int seed) 


{ 
uniform int_distribution<> ui_dist {0, 255}; 
default random engine rng {seed}; 
for (uint32 t i = 0; i« n; i++) 
x[i] = ui dist(rng); 
) 
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void InitF32(float* x, uint32 t n, unsigned int seed) 


{ 


} 


uniform int distribution«» ui dist (0, 1000}; 
default random engine rng (seed); 


for (uint32 t i = 0; i <n; i++) 
x[i] = (float)ui dist(rng) / 1000.0f; 


void Avx512ConvertImgU8ToF32(void) 


{ 


} 


const size t align = 64; 

const uint32 t num pixels - 1024; 
AlignedArray«uint8 t» src aa(num pixels, align); 
AlignedArray«float» desi aa(num pixels, align); 
AlignedArray«float» des2 aa(num pixels, align); 
uint8 t* src - src aa.Data(); 

float* desi = des1 aa.Data(); 

float* des2 - des2 aa.Data(); 


InitU8(src, num pixels, 12); 


bool rci = Avx512ConvertImgU8ToF32Cpp(des1, src, num pixels); 
bool rc2 = Avx512ConvertImgU8ToF32 (des2, src, num pixels); 


cout << "\nResults for Avx512ConvertImgU8ToF32\n"; 


if (!rc1 || !rc2) 
cout << "Invalid return code - "; 
cout << "rc1 = " << boolalpha << rci << ", "; 
cout << "rc2 = " << boolalpha << rc2 << ‘\n'; 


return; 


} 


uint32 t num diff = Avx512ConvertImgVerify(des1, des2, num pixels); 
cout «« " Number of pixel compare errors (num diff) - " «« num diff «« 


void Avx512ConvertImgF32ToU8 (void) 


{ 
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const size t align = 64; 

const uint32 t num pixels = 1024; 
AlignedArray<float> src aa(num pixels, align); 
AlignedArray«uint8 t» desi aa(num pixels, align); 
AlignedArray«uint8 t» des2 aa(num pixels, align); 
float* src - src aa.Data(); 

uint8 t* desi = desi aa.Data(); 

uint8 t* des2 - des2 aa.Data(); 


"\n'; 
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InitF32(src, num pixels, 20); 


// Test values to demonstrate clipping in conversion functions 


src[0] = 0.5f; src[8] = 3.33f; 
src[1] = -1.0f; src[9] = 0.67f; 
src[2] = 0.38f; src[10] = 0.75f; 
src[3] = 0.62f; src[11] = 0.95f; 
src[4] = 2.1f; src[12] = -0.33f; 
src[5] = 0.25f; src[13] = 0.8f; 

src[6] = -1.25f; src[14] = 0.12f; 
src[7] = 0.45f; src[15] = 4.0f; 


bool rci = Avx512ConvertImgF32ToU8Cpp(des1, src, num pixels); 
bool rc2 = Avx512ConvertImgF32ToU8 (des2, src, num pixels); 


cout << "\nResults for Avx5i2ConvertImgF32ToU8 Nn"; 


if (!rc1 || !rc2) 

{ 
cout << "Invalid return code - "; 
cout << "rci = " << boolalpha << rci << ", "; 
cout << "rc2 = " << boolalpha << rc2 << ‘\n'; 
return; 

} 


uint32 t num diff = Avx512ConvertImgVerify(des1, des2, num pixels); 


cout << " Number of pixel compare errors (num diff) = " << num diff << '\n'; 
h 
int main() 
{ 
Avx512ConvertImgU8ToF32() ; 
Avx512ConvertImgF32ToU8() ; 
return 0; 
} 
H ---------2.-2.-2-2-22-22-2-2222-2-2-22-2-2-22-2-2-22-2-2-2--2-2-2--2--22-2------- 
F Ch14_02.asm 
H --------2--2.-2.-2-2-2-2-22-2-2-222-2-2-2-2-2-2-2--2-2-22-22-2--2--2-2-2-2-22--2-------- 
include «cmpequ.asmh» 
extern c NumPixelsMax:dword 
.const 
r4 1pO real4 1.0 


14 255p0 real4 255.0 


499 


CHAPTER 14 = AVX-512 PROGRAMMING — PACKED INTEGERS 


; extern "C" bool Avx512ConvertImgU8ToF32 (float* des, const uint8 t* src, uint32 t 
num pixels) 


.code 
Avx512ConvertImgU8ToF32_ proc 


; Make sure num pixels is valid and pixel buffers are properly aligned 


xor eax,eax ;set error return code 

or r8d,r8d 

jz Done ;jump if num pixels is zero 
cmp r8d,[c NumPixelsMax] 

ja Done ;jump if num pixels too big 
test r8d,3fh 

jnz Done ;jump if num pixels % 64 != 0 
test rcx,3fh 

jnz Done ;jump if des not aligned 
test rdx,3fh 

jnz Done ;jump if src not aligned 


; Perform required initializations 
shr r8d,6 ;number of blocks (64 pixels/block) 
vmovss xmmO,real4 ptr [r4 1p0] 
vdivss xmm1,xmmO,real4 ptr [r4 255p0] 
vbroadcastss zmm5,xmm1 ;packed scale factor (1.0 / 255.0) 


align 16 
QQ : vpmovzxbd zmmo,xmmword ptr [rdx] 
vpmovzxbd zmm1,xmmword ptr [rdx+16] 
vpmovzxbd zmm2,xmmword ptr [rdx432] 
vpmovzxbd zmm3,xmmword ptr [rdx+48] ;zmm3:zmmO = 64 U32 pixels 


; Convert pixels from uint8 t to float [0.0, 255.0] 
vcvtudq2ps zmm16,zmmo 
vcvtudq2ps zmm17,zmm1 
vcvtudq2ps zmm18,zmm2 
vcvtudq2ps zmm19,zmm3 3zmm19:zmm16 = 64 F32 pixels 


; Normalize pixels to [0.0, 1.0] 
vmulps zmm20,zmm16,zmm5 
vmulps zmm21,zmm17,zmm5 
vmulps zmm22,zmm18,zmm5 
vmulps zmm23,zmm19,zmm5 ;zmm23:zmm20 - 64 F32 pixels (normalized) 


; Save F32 pixels to des 
vmovaps zmmword ptr [rcx],zmm20 
vmovaps zmmword ptr [rcx+64],zmm21 
vmovaps zmmword ptr [rcx+128],zmm22 
vmovaps zmmword ptr [rcx+192],zmm23 
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; Update pointers and counters 
add rdx,64 
add rcx,256 
sub r8d,1 
jnz GB 


mov eax,1 ;set success return code 
Done:  vzeroupper 
ret 


Avx512ConvertImgU8ToF32_ endp 


3 extern "C" bool Avx512ConvertImgF32ToU8 (uint8 t* des, const float* src, uint32 t 
num_pixels) 


Avx512ConvertImgF32ToU8_ proc 
; Make sure num pixels is valid and pixel buffers are properly aligned 


xor eax,eax ;set error return code 

or r8d,r8d 

jz Done ;jump if num pixels is zero 
cmp r8d,[c NumPixelsMax] 

ja Done ;jump if num pixels too big 
test r8d,3fh 

jnz Done ;jump if num pixels % 64 != 0 
test rcx,3fh 

jnz Done ;jump if des not aligned 
test rdx,3fh 

jnz Done ;jump if src not aligned 


; Perform required initializations 


shr r8d,4 ;number of pixel blocks (16 pixels / block) 
vxorps zmm29,zmm29,zmm29 ;packed 0.0 
vbroadcastss zmm30,[r4 1po] ;packed 1.0 
vbroadcastss zmm31,[r4 255p0] ;packed 255.0 
align 16 
@@: vmovaps zmmO,zmmword ptr [rdx] ;zmmO = block of 16 pixels 


; Clip pixels in current block to [0,0. 1.0] 


vcmpps ki,zmmo,zmm29,CMP GE 3k1 = mask of pixels >= 0.0 
vmovaps zmm1{k1}{z},zmmo ;all pixels >= 0.0 

vcmpps k2,zmm1,zmm30,CMP_GT ;k2 = mask of pixels > 1.0 
vmovaps zmm1{k2},zmm30 ;all pixels clipped to [0.0, 1.0] 


; Convert pixels to uint8 t and save to des 


vmulps zmm2,zmmi,zmm31 ;all pixels [0.0, 255.0] 
vcvtps2udq zmm3,zmm2{ru-sae} ;all pixels [0, 255] 
vpmovusdb xmmword ptr [rcx],zmm3 ;save pixels as unsigned bytes 
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; Update pointers and counters 
add rdx,64 
add rcx,16 
sub r8d,1 
jnz @B 


mov eax,1 ;set success return code 


Done:  vzeroupper 
ret 
Avx512ConvertImgF32ToU8  endp 
end 


The C++ code in Listing 14-2 begins with the requisite function declarations. The first declaration set 
is for the functions Avx512ConvertImgU8ToF32Cpp and Avx512ConvertImgU8ToF32Cpp, which are defined in 
the file Ch14 02 Misc.cpp. The source code for these functions is not shown since they're almost identical to 
the AVX2 counterpart functions that were used in source code example Ch07 06. Two minor changes were 
made: the source and destination pixel buffers are aligned on a 64-byte instead of a 16-byte boundary; the 
number of pixels in these buffers must be evenly divisible by 64 instead of 32. 

The function Avx512ConvertImgU8ToF32 initializes the test arrays for converting pixels values from 
uint8 tto float. This function uses the C++ template class AlignedArray<> to allocate these arrays on 
a 64-byte boundary. Following test array initialization, Avx512ConvertImgU8ToF32 invokes the C++ and 
assembly language conversion functions. It then calls Avx512ConvertImgVerify to verify the results. The 
function Avx512ConvertImgF32ToU8 converts pixel values from float to uint8 t. Note that this function 
intentionally initializes the first few values of the source pixel buffer src to known values in order to verify 
that the conversion functions properly clip out-of-range pixel values. 

The assembly language function Avx512ConvertImgU8ToF32_ begins its execution by validating 
num_pixels. It then confirms that the pixel buffers src and des are properly aligned on a 64-byte boundary. 
In source code example Ch07 06 from Chapter 7, pixel normalization was performed by dividing each 
pixel value by 255.0. Avx512ConvertImgU8ToF32_ carries out pixel normalization using the multiplicative 
scale factor 1.0/255.0 since floating-point multiplication is usually faster than floating-point division. The 
vbroadcastss zmm5,xmm1 instruction loads a packed version of this scale factor into register ZMM5. 

Each processing loop iteration starts with a vpmovzxbd zmmo,xmmword ptr [rdx] instruction. This 
instruction copies and zero-extends the 16-byte (or uint8 t) pixels pointed to by RDX to doublewords; it 
then saves these values in register ZMMO. Three more vpmovzxbd instructions are then employed to load 
another 48 pixels into registers ZMM1, ZMM2, and ZMM3. This is followed by four vcvtudq2ps instructions 
that convert each unsigned doubleword pixel value in registers ZMM0-ZMMS to single-precision floating- 
point. The ensuing vmulps instructions multiply these values by the normalization scale factor; the results 
are then saved to the destination pixel buffer des using a series of vmovaps instructions. 

In source code example Ch07 06, all floating-point pixel values were clipped to [0.0, 1.0] before being 
converted to uint8 t values. The function Avx512ConvertImgF32ToU8 also performs this same operation. 
Following its argument validation checks, Avx512ConvertImgF32ToU8 loads registers ZMM29, ZMM30, and 
ZMM31 with packed versions of the single-precision floating-point constants 0.0, 1.0, and 255.0, respectively. 
The processing loop of Avx512ConvertImgF32ToU8 begins each iteration with a vmovaps zmmO,zmmword ptr 
[rdx] that loads a block of 16 single-precision floating-point pixels into register ZMMO. The ensuing vcmpps 
k1, zmmO, zmm29,CMP GE instruction compares each pixel element in ZMMO to 0.0 and saves the resultant 
compare mask in opmask register K1. The next instruction, vmovaps zmm1{k1}{z},zmmO, uses zero masking 
to eliminate all pixel values less than 0.0. Figure 14-1 illustrates these operations. 
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Initial values 


[ESSE FE EE n RS RS FEST ESSE e 
ETSI ESI ES ERES ESI ES ESIEESS ESTEE ESTE ad 
Ese etes eee e s Ee ES ESTE RE 


vmovaps zmm0,zmmword ptr [rdx] ; zmmO block of 16 pixels 

[m es To eme e Tee T [ox TT [To [e TT] ns 
vcmpps kl,zmmO0,zmm29,CMP GE 7kl = mask of pixels >= 0.0 

[ee epee ae ee fs aes] ssn 
vmovaps zmm1{k1}{z),zmm0 yall pixels >= 0.0 

400 | oae | oso | o0 |as | 075 |7 | ass | was | ow [oas | 220 | ose | oe | wo | oso | amma 
vcmpps k2,zmml,zmm30,CMP GT ;k2 = mask of pixels > 1.0 

LESE ETE EST IE DEGREES LES USE ESSI LIES ES ER 
vmovaps zmmlík2),zmm30 ;all pixels clipped to [0.0, 1.0] 

[s [es enr es eee] en es ed ee [en | a 
vmulps zmm2,2zmml,zmm3l ;all pixels [0.0, 255.0] 

es e [se] o spon e ce p [s ce [ir] e 
vcvtps2udq zmm3,zmm2{ru-sae} ;all pixels [0, 255] 


Se Reese) 


Figure 14-1. Instruction sequence used to convert packed pixel values from floating-point to unsigned 
doubleword integers 


The subsequent vcmpps k2, zmm1,zmm30,CMP GT instruction creates a mask of pixel values greater 
than 1.0 and saves this mask in opmask register K2. Following execution of the vmovaps zmm1{k2},zmm30 
instruction, all of the pixel values in register ZMM1 are greater than or equal to 0.0 and less than or equal 
to 1.0. The next two instructions, vmulps zmm2,zmm1,zmm31 and vcvtps2udq zmm3,zmm2{ru-sae}, convert 
the normalized floating-point pixel values to unsigned doubleword integers. Note that the vcvtps2udq 
instruction employs an instruction-level rounding control operand (round up) primarily for demonstration 
purposes. The ensuing vpmovusdb xmmword ptr [rcx],zmm3 instruction size-reduces the doubleword 
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values to bytes using unsigned saturation and saves them in the destination buffer pointed to by RCX. Here 
are the results for source code example Ch14_ 02: 


Results for Avx512ConvertImgU8ToF32 


Number of pixel compare errors (num diff) - O 
Results for Avx512ConvertImgF32ToU8 
Number of pixel compare errors (num diff) - O 


Image Thresholding 


In source code example Ch07 08, you learned about image thresholding and how to create a binary (or two 
color) mask image. Briefly, thresholding is an image-processing technique that sets a mask image pixel to 
Oxff to signify that the intensity value of the corresponding pixel in a grayscale image is greater than a pre- 
determined threshold intensity value; otherwise, the mask image pixel is set to 0x00. The next source code 
example, Ch14 03, expands the image-thresholding technique that was used in Ch07 08 to support multiple 
compare operators. Listing 14-3 shows the source code for example Ch14 03. 


Listing 14-3. Example Ch14 03 


#pragma once 
#include «cstdint» 


// Compare operators 
enum CmpOp { EQ, NE, LT, LE, GT, GE }; 


// Ch14 03 Misc.cpp 
extern void Init(uint8 t* x, size t n, unsigned int seed); 
extern void ShowResults(const uint8 t* des1, const uint8 t* des2, size t num pixels, CmpOp 
cmp op, 
uint8 t cmp val, size t test id); 


// Ch14 03 .asm 
extern "C" bool Avx512ComparePixels (uint8 t* des, const uint8 t* src, size t num pixels, 
CmpOp cmp op, uint8 t cmp val); 


#include "stdafx.h" 
#include <iostream> 
#include «cassert» 
#include "Ch14 03.h" 
#include "AlignedMem.h" 


504 


CHAPTER 14 = AVX-512 PROGRAMMING — PACKED INTEGERS 


using namespace std; 
extern "C" const size t c NumPixelsMax = 16777216; 


bool Avx512ComparePixelsCpp(uint8 t* des, const uint8 t* src, size t num pixels, CmpOp 
cmp op, uint8 t cmp val) 
{ 
// Make sure num pixels is valid 
if ((num pixels == 0) || (num pixels > c NumPixelsMax)) 
return false; 
if ((num pixels & ox3f) != 0) 
return false; 


// Make sure src and des are aligned on a 64-byte boundary 
if (!AlignedMem::IsAligned(src, 64)) 

return false; 
if (!AlignedMem::IsAligned(des, 64)) 

return false; 


bool rc - true; 
const uint8 t cmp false - 0x00; 
const uint8 t cmp true - Oxff; 


switch (cmp op) 


case CmpOp::EO: 
for (size t i = 0; i < num pixels; i++) 
des[i] = (src[i] == cmp val) ? cmp true : cmp false; 
break; 


case CmpOp::NE: 
for (size t i = 0; i « num pixels; i++) 
des[i] = (src[i] != cmp val) ? cmp true : cmp false; 
break; 


case CmpOp::LT: 
for (size t i = 0; i « num pixels; i++) 
des[i] = (src[i] « cmp val) ? cmp true : cmp false; 
break; 


case CmpOp::LE: 
for (size t i = 0; i « num pixels; i++) 
des[i] = (src[i] <= cmp val) ? cmp true : cmp false; 
break; 


case CmpOp: :GT: 
for (size t i = 0; i « num pixels; i++) 
des[i] = (src[i] > cmp val) ? cmp true : cmp false; 
break; 
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int 
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case CmpOp::GE: 
for (size t i = 0; i < num pixels; i++) 
des[i] = (src[i] >= cmp val) ? cmp true : cmp false; 


break; 
default: 
cout << "Invalid CmpOp: " << cmp op << ‘\n'; 
rc - false; 
) 
return rc; 
main() 


const size t align - 64; 

const size t num pixels - 4 * 1024 * 1024; 
AlignedArray«uint8 t» src aa(num pixels, align); 
AlignedArray«uint8 t» desi aa(num pixels, align); 
AlignedArray«uint8 t» des2 aa(num pixels, align); 
uint8 t* src - src aa.Data(); 

uint8 t* desi = des1 aa.Data(); 

uint8 t* des2 - des2 aa.Data(); 


const uint8 t cmp vals[] (197, 222, 43, 43, 129, 222}; 

const CmpOp cmp ops[] (CmpOp::EO, CmpOp::NE, CmpOp::LT, CmpOp::LE, CmpOp::GT, CmpOp 
const size t num cmp vals - sizeof(cmp vals) / sizeof(uint8 t); 

const size t num cmp ops = sizeof(cmp ops) / sizeof(CmpOp); 

assert(num cmp vals -- num cmp ops); 

Init(src, num pixels, 511); 


cout << "Results for Chi4 03\n"; 


for (size t i = 0; i « num cmp ops; i++) 


{ 
Avx512ComparePixelsCpp(des1, src, num pixels, cmp ops[i], cmp vals[i]); 
Avx512ComparePixels (des2, src, num pixels, cmp ops[i], cmp vals[i]); 
ShowResults(desi, des2, num pixels, cmp ops[i], cmp vals[i], i + 1); 

) 

return 0; 


include «cmpequ.asmh» 
extern c NumPixelsMax:qword 


::GE}; 


; Macro CmpPixels 


_CmpPixels macro CmpOp 
align 16 


QQ : vmovdqa64 zmmo,zmmword ptr [rdx+rax] 


vpcmpub k1,zmmo,zmm4,CmpOp 
vmovdqu8 zmm1{k1}{z},zmms 


vmovdqa64 zmmword ptr [rcx«rax],zmmi 


add rax,64 
sub r8,64 
jnz GB 

mov eax,1 
vzeroupper 
ret 

endm 
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;load next block of 64 pixels 

;perform compare operation 

;set mask pixels to O or 255 using opmask 
;save mask pixels 


;update offset 


;repeat until done 
;set success return code 


; extern "C" bool Avx512ComparePixels (uint8 t* des, const uint8 t* src, 
; size t num pixels, CmpOp cmp op, uint8 t cmp val); 


.code 
Avx512ComparePixels_ proc 


; Make sure num pixels is valid and pixel buffers are properly aligned 


XOI eax,eax 


or r8,r8 

jz Done 

cmp r8,[c NumPixelsMax] 
ja Done 

test r8,3fh 

jnz Done 


test rcx,3fh 
jnz Done 
test rdx,3fh 
jnz Done 


; Perform required initializations 


vpbroadcastb zmm4,byte ptr [rsp+40] 


mov r10d,255 
vpbroadcastb zmm5,r10d 


; Perform specified compare operation 
cmp r9d,0 
jne LB NE 
_CmpPixels CMP EQ 


LB NE: cmp r9d,1 
jne LB LT 
_CmpPixels CMP_NEQ 


;set error code (also array offset) 


;jump if num pixels is zero 
;jump if num pixels too big 


;jump if num pixels % 64 != 0 


;jump if des not aligned 


;jump if src not aligned 


;zmm4 = packed cmp val 
;zmm5 = packed 255 
;CmpOp: : EO 

; CmpOp: : NE 
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LB LT: cmp r9d,2 
jne LB LE 
_CmpPixels CMP LT ;CmpOp: : LT 


LB LE: cmp r9d,3 
jne LB GT 
_CmpPixels CMP LE ;CmpOp: : LE 


LB GT: cmp r9d,4 
jne LB GE 
_CmpPixels CMP NLE ;CmpOp: :GT 


LB GE: cmp r9d,5 
jne Done 
_CmpPixels CMP NLT ;CmpOp: : GE 


Done:  vzeroupper 
ret 
Avx512ComparePixels_ endp 
end 


Near the top of the header file Ch14_03.h is an enum named CmpOp, which contains identifiers for the 
common compare operations. This is followed by the example's function declarations. The C++ functions 
Init and ShowResults are ancillary functions that perform test array initialization and display results. 

The source code for these functions is not shown in Listing 14-3 but included with the chapter download 
package. The function Avx512ComparePixels is an AVX-512 assembly language functions that implements 
the pixel thresholding algorithm. 

The function Avx512ComparePixelsCpp contains the C++ implementation of the updated thresholding 
algorithm. This function begins its execution by validating num pixels for size and divisibility by 64. It 
then verifies that the pixel buffers src and des are properly aligned on a 64-byte boundary. Following 
argument validation code is a switch statement that applies the selector cmp op to select a compare 
operation. Each switch statement case code block is a simple for loop that compares src[i] against 
cmp_val using the specified operator and sets pixels in the mask image to Oxff (true compare) or 0x00 (false 
compare). The function main includes code that allocates the image pixel buffers, exercises the functions 
Avx512ComparePixelsCpp and Avx512ComparePixels using various compare operators and displays results. 

The assembly language code in Listing 14-3 commences with the macro CmpPixels. This macro 
generates AVX-512 code that implements a processing loop for a pixel compare operator. The macro 
_CmpPixels requires the following register initializations prior to its use: RAX = 0, RCX = mask image pixel 
buffer, RDX = grayscale image pixel buffer, R8 = number of pixels, ZMMA = packed byte threshold values, and 
ZMM5 = packed Oxff byte values. Each processing loop iteration of _CmpPixels begins with a vmovdqa64 
zmmO,zmmword ptr [rdx«rax] instruction that loads 64 unsigned 8-bit integers into register ZMMO. The 
next instruction, vpcmpub k1, zmmo, zmm4 , CnpOp, compares the grayscale pixel intensity values in ZMMO to 
the packed values in ZMM4; it then saves the resultant mask in opmask register K1. The ensuing vmovdqu8 
zmm1{k1}{z},zmm5 instruction sets each mask pixel value in ZMMI to Oxff (true compare) or 0x00 (false 
compare) according to the value of the corresponding bit position in K1. The instruction vmovdqa64 zmmword 
ptr [rcx+rax],zmm1 then saves the 64 mask pixels to the mask image pixel buffer. 

The function Avx512ComparePixels employs the macro CmpPixels to implement the same algorithm 
as its C++ counterpart Avx512ComparePixelsCpp. Follow the requisite argument validation checks, a 
vpbroadcastb zmm4,byte ptr [rsp+40] instruction broadcasts cmp_val to each byte element in register 
ZMM4. The next two instructions, mov r10d,255 and vpbroadcastb zmm5,r10d, load the value Oxff into 
each byte element of ZMM5. The remaining code in Avx512ComparePixels uses the argument value 
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cmp_val to implement an ad hoc switch statement that takes advantage of the macro CmpPixels. Note 
that this function uses the compare equates CMP_NLE (not less than or equal) or CMP_NLT (not less than) for 
the CmpPixels macro argument CmpOp instead of CMP GT or CMP GE. The reason for this is that the vpcmpub 
instruction in CmpPixels does not support use of the CMP GT and CMP GE equates (mathematically these 
latter equates are equivalent to CMP_NLE and CMP_NLT, but are assigned different values in cmpequ. asmh). 
Here are the results for source code example Ch14 03: 


Results for Ch14 03 


Test #1 
num pixels: 4194304 
cmp op: EQ 


cmp val: 197 
Pixel masks are identical 
Number of non-zero mask pixels - 16424 


Test #2 
num pixels: 4194304 
cmp op: NE 
cmp val: 222 


Pixel masks are identical 
Number of non-zero mask pixels - 4177927 


Test #3 
num pixels: 4194304 
cmp op: LT 
cmp val: 43 


Pixel masks are identical 
Number of non-zero mask pixels - 703652 


Test it4 
num pixels: 4194304 
cmp op: LE 
cmp val: 43 


Pixel masks are identical 
Number of non-zero mask pixels - 719787 


Test #5 
num pixels: 4194304 
cmp op: GT 
cmp val: 129 


Pixel masks are identical 
Number of non-zero mask pixels - 2065724 


Test #6 
num pixels: 4194304 
cmp op: GE 
cmp val: 222 


Pixel masks are identical 
Number of non-zero mask pixels - 556908 
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Image Statistics 


Listing 14-4 shows the source code for example Ch14 04. This example illustrates how to calculate the mean 
and standard deviation of a grayscale image using its pixel intensity values. In order to make source code 
example Ch14_04 a little more interesting, the C++ and assembly language functions use only the pixel 
values that reside between two threshold limits. Pixel values outside of these limits are excluded from any 
mean and standard deviation calculations. 


Listing 14-4. Example Ch14_ 04 


#pragma once 
#include <cstdint> 


// This structure must match the structure that's defined in Ch14 04 .asm. 

struct ImageStats 

{ 
uint8 t* m PixelBuffer; 
uint64 t m NumPixels; 
uint32 t m PixelValMin; 
uint32 t m PixelValMax; 
uint64 t m NumPixelsInRange; 
uint64 t m PixelSum; 
uint64 t m PixelSumOfSquares; 
double m PixelMean; 
double m PixelSd; 

n 


// Ch14_04.cpp 
extern bool Avx512CalcImageStatsCpp(ImageStats& im stats); 


// Ch14 04 .asm 
extern "C" bool Avx512CalcImageStats (ImageStats& im stats); 


// Ch04 04 BM.cpp 
extern void Avx512CalcImageStats BM(void); 


// Common constants 
const uint32 t c PixelValMin - 40; 
const uint32 t c PixelValMax - 230; 


#include "stdafx.h" 
#include «cstdint» 
#include <iostream> 
#include <iomanip> 
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#include <fstream> 
#include «string» 
#include <stdexcept> 
#include "Chi4 04.h" 
#include "AlignedMem.h" 
#include "ImageMatrix.h" 
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using namespace std; 


extern "C" uint64 t c NumPixelsMax = 256 * 1024; 


bool Avx512CalcImageStatsCpp(ImageStats& im stats) 


( 


uint64 t num pixels - im stats.m NumPixels; 
const uint8 t* pb - im stats.m PixelBuffer; 


// Perform validation checks 

if ((num pixels == 0) || (num pixels > c NumPixelsMax)) 
return false; 

if (!AlignedMem::IsAligned(pb, 64)) 
return false; 


// Calculate intermediate sums 
im stats.m PixelSum - 0; 

im stats.m PixelSumOfSquares - 0; 
im stats.m NumPixelsInRange - 0; 


for (size t i = 0; i « num pixels; i++) 


{ 


uint32 t pval = pb[i]; 


if (pval >= im stats.m PixelValMin 8& pval <= im stats.m PixelValMax) 


{ 


} 


// Calculate 


double 
double 
double 
double 
double 


im stats.m PixelSum += pval; 
im stats.m PixelSumOfSquares += pval * pval; 
im stats.m NumPixelsInRange++; 


tempo - 
tempi - 
var num 
var den 


mean and standard deviation 


(double)im stats.m NumPixelsInRange * im stats.m PixelSumOfSquares; 
(double)im stats.m PixelSum * im stats.m PixelSum; 

= tempo - tempi; 

= (double)im stats.m NumPixelsInRange * (im stats.m NumPixelsInRange - 1); 


var - var num / var den; 


im stats.m PixelMean - (double)im stats.m PixelSum / im stats.m NumPixelsInRange; 
im stats.m PixelSd - sqrt(var); 


return 


true; 
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void Avx512CalcImageStats() 


{ 
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const wchar t* image fn = L"..\\Ch14 Data\\TestImage4.bmp"; 


ImageStats is1, is2; 

ImageMatrix im(image fn); 

uint64 t num pixels - im.GetNumPixels(); 
uint8 t* pb - im.GetPixelBuffer«uint8 t»(); 


isi.m PixelBuffer = pb; 

isi.m NumPixels = num pixels; 
is1.m PixelValMin = c PixelValMin; 
is1.m PixelValMax = c PixelValMax; 


is2.m PixelBuffer - pb; 

is2.m NumPixels - num pixels; 
is2.m PixelValMin = c PixelValMin; 
is2.m PixelValMax = c PixelValMax; 


const char nl 
const char* s - 
const unsigned int w1 - 22; 
const unsigned int w2 - 12; 


M 
A 
35 

` 


l 
` 


cout << fixed << setprecision(6) << left; 
wcout << fixed << setprecision(6) << left; 


cout << "\nResults for Avx512CalcImageStats\n"; 

wcout << setw(w1) << "image_fn:" << setw(w2) << image_fn << nl; 

cout << setw(w1) << "num pixels:" << setw(w2) << num pixels << nl; 

cout << setw(w1) << "c PixelValMin:" << setw(w2) << c PixelValMin << nl; 
cout << setw(w1) << "c PixelValMax:" << setw(w2) << c PixelValMax << nl; 


bool rci = Avx512CalcImageStatsCpp(is1); 
bool rc2 = Avx512CalcImageStats (is2); 


if (Iret || !rc2) 
( 
cout << "Bad return code\n"; 


cout << " rc1 =" << rel << ‘\n'; 
<< re2 << ‘\n'; 


cout << rc2 = 
return; 


cout << nl; 


cout << setw(w1) << "m NumPixelsInRange: "; 
cout << setw(w2) << is1.m NumPixelsInRange << s; 
cout << setw(w2) << is2.m NumPixelsInRange << nl; 


cout << setw(w1) << "m PixelSum:"; 
cout << setw(w2) << is1.m PixelSum << s; 
cout << setw(w2) << is2.m PixelSum << nl; 
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cout «« setw(w1) «« "m PixelSumOfSquares:"; 
cout << setw(w2) << isi.m PixelSumOfSquares << s; 
cout «« setw(w2) «« is2.m PixelSumOfSquares «« nl; 
cout «« setw(w1) «« "m PixelMean:"; 
cout << setw(w2) << is1.m PixelMean << s; 
cout «« setw(w2) «« is2.m PixelMean «« nl; 
cout << setw(w1) << "m PixelSd:"; 
cout << setw(w2) << is1.m PixelSd << s; 
cout << setw(w2) << is2.m PixelSd << nl; 
} 
int main() 
1 
try 
{ 
Avx512CalcImageStats(); 
Avx512CalcImageStats_BM(); 
} 
catch (runtime error& rte) 
{ 
cout << "'runtime error' exception has occurred - " << rte.what() << '\n'; 
} 
catch (...) 
{ 
cout << "Unexpected exception has occurred\n"; 
cout << "File = "<< FILE << '\n'; 
} 
return 0; 
} 
H --------2--2-22-2-2-222-2-2-2-2-2--2-2-22-2-2-2-2-2-222-2-2--2--2-2-2--2--22--2------- 
P Ch14 04.asm 


include «cmpequ.asmh» 
include «MacrosX86-64-AVX.asmh» 
extern c NumPixelsMax:qword 


; This structure must match the structure that's defined in Ch14 04.h 


ImageStat 


PixelBuffer 


NumPixels 


PixelValMin 
PixelValMax 


E 


NumPixelsInRange 


PixelSum 


PixelSumOfSquares 


struct 
qword 
qword 
dword 
dword 
qword 
qword 
qword 
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PixelMean real8 ? 
PixelSd real8 ? 
ImageStats ends 


_UpdateSums macro Disp 
vpmovzxbd zmmo,xmmword ptr [rcx+Disp]  ;zmmO = 16 pixels 


vpcmpud k1,zmmO,zmm31,CMP GE 3k1 = mask of pixels >= pixel val min 
vpcmpud k2,zmmO,zmm30,CMP LE ;k2 = mask of pixels <= pixel val max 
kandw k3,k2,k1 ;k3 = mask of in-range pixels 
vmovdqa32 zmm1{k3}{z},zmmo ;zmmi = in-range pixels 

vpaddd zmm16,zmm16,zmm1 ;update packed pixel sum 

vpmulld zmm2,zmmi,zmmi 

vpaddd zmm17,zmm17,zmm2 ;update packed pixel sum of squares 
kmovw rax,k3 

popcnt rax,rax ;count number of in-range pixels 

add r10,rax ;update num pixels in range 

endm 


; extern "C" bool Avx512CalcImageStats (ImageStats& im stats); 


.code 
Avx512CalcImageStats_ proc frame 
_CreateFrame CIS ,0,0,rsi,r12,r13 


_EndProlog 
; Make sure num pixels is valid and pixel buff is properly aligned 
xor eax,eax ;set error return code 
mov rsi,rcx 3rsi = im stats ptr 


mov rcx,qword ptr [rsi+ImageStats.PixelBuffer] ;rcx = pixel buffer ptr 


mov rdx,qword ptr [rsi+ImageStats.NumPixels] ;rdx = num pixels 

test rdx,rdx 

jz Done ;jump if num pixels is zero 

cmp rdx,[c NumPixelsMax] 

ja Done ;jump if num pixels too big 

test rcx,3fh 

jnz Done ;jump if pixel buff misaligned 
; Perform required initializations 

mov r8d,dword ptr [rsi+ImageStats.PixelValMin] 

mov r9d,dword ptr [rsi+ImageStats.PixelValMax] 

vpbroadcastd zmm31,r8d ;packed pixel val min 

vpbroadcastd zmm30,r9d ;packed pixel val max 

vpxorq zmm29,zmm29,zmm29 ;packed pixel sum 

vpxorq zmm28,zmm28,zmm28 ;packed pixel sum of squares 

xor r10d,r10d ;num pixels in range = 0 
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; Compute packed versions of pixel sum and pixel sum of squares 


cmp rdx,64 
jb LB1 


align 16 
@@: vpxord zmm16,zmm16,zmm16 
vpxord zmm17,zmm17,zmm17 


_UpdateSums 0 
_UpdateSums 16 
_UpdateSums 32 
_UpdateSums 48 


vextracti32x8 ymmO,zmm16,1 
vpaddd ymm1,ymmO,ymm16 
vpmovzxdq zmm2,ymmi 

vpaddq zmm29,zmm29,zmm2 


vextracti32x8 ymmo,zmm17,1 
vpaddd ymmi, ymmo, ymm17 
vpmovzxdq zmm2,ymmi 

vpaddq zmm28,zmm28,zmm2 


add rcx,64 
sub rdx,64 
cmp rdx,64 
jae GB 


align 16 
LB1: test rdx,rdx 
jz LB3 


xor r13,r13 
xor r12,r12 
mov r11,rdx 


QQ : movzx rax,byte ptr [rcx] 
cmp rax,r8 
jb LB2 
cmp rax,r9 
ja LB2 


add r13,rax 
mul rax 

add r12,rax 
add r10,1 


LB2: add rcx,1 
sub r11,1 
jnz GB 


; Save num pixel in range 


;junp if there are fewer than 64 pixels 
;loop packed pixel sum - O 

;loop packed pixel sum of squares - O 
;process pixel buff[i+15]:pixel buff[i] 
;process pixel buff[i+31]:pixel buff[i+16] 
;process pixel buff[i+47]:pixel buff[i+32] 
;process pixel buff[i+63]:pixel buff[i+48] 


;extract top 8 pixel sum (dwords) 


;update packed pixel sum (qwords) 


;extract top 8 pixel sum of squares (dwords) 


;update packed pixel sum of squares (qwords) 


;update pb ptr 
;update num pixels 


;repeat until done 


;jump if no more pixels remain 
;pixel sum = 0 
;pixel sum of squares = 0 

;number of remaining pixels 

;load next pixel 

;jump if current pixel < pval min 
;jump if current pixel > pval max 
;add to pixel sum 


;add to pixel sum of squares 
;update num pixels in range 


;repeat until done 


LB3: mov qword ptr [rsi«ImageStats.NumPixelsInRange],r10 
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; Reduce packed pixel sum to single qword 
vextracti64x4 ymmo, zmm29,1 
vpaddq ymmi, ymmo, ymm29 
vextracti64x2 xmm2,ymm1,1 
vpaddq xmm3,xmm2,xmm1 
vpextrq rax,xmm3,0 
vpextrq r11,xmm3,1 
add rax,r11 ;rax = sum of qwords in zmm29 
add r13,rax ;add scalar pixel sum 


mov qword ptr [rsi«ImageStats.PixelSum],r13 


;Reduce packed pixel sum of squares to single qword 
vextracti64x4 ymmo, zmm28,1 
vpaddq ymmi,ymmo,ymm28 
vextracti64x2 xmm2,ymm1,1 
vpaddq xmm3,xmm2,xmm1 
vpextrq rax,xmm3,0 
vpextrq r11,xmm3,1 
add rax,r11 ;rax - sum of qwords in zmm28 
add r12,rax ;add scalar pixel sum of squares 


mov qword ptr [rsi«ImageStats.PixelSumOfSquares],r12 


; Calculate final mean and sd 


vcvtusi2sd xmmO,xmmo,r10 ;num pixels in range (DPFP) 

sub r10,1 

vcvtusi2sd xmm1,xmm1i,r10 ;num pixels in range - 1 (DPFP) 
vcvtusi2sd xmm2,xmm2,r13 ;pixel sum (DPFP) 

vcvtusi2sd xmm3,xmm3,r12 ;pixel sum of squares (DPFP) 
vdivsd xmm4,xmm2,xmmO ;final pixel mean 


vmovsd real8 ptr [rsi+ImageStats.PixelMean] , xmm4 


vmulsd xmm4,xmmo,xmm3 ;num pixels in range * pixel sum of squares 
vmulsd xmm5,xmm2,xmm2 ;pixel sum * pixel sum 

vsubsd xmm2,xmm4, xmm5 ;var num 

vmulsd xmm3,xmmo,xmm1 ;var den 

vdivsd xmm4,xmm2,xmm3 ;calc variance 

vsqrtsd xmmo,xmmo,xmm4 ;final pixel sd 


vmovsd real8 ptr [rsi+ImageStats.PixelSd], xmmO 
mov eax,1 ;set success return code 


Done: | vzeroupper 
_DeleteFrame rsi,r12,r13 
ret 
Avx512CalcImageStats_ endp 
end 
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The mean and standard deviation ofthe pixels in a grayscale image can be calculated using the 
following equations: 


In the mean and standard deviation equations, the symbol x represents an image buffer pixel and n 
denotes the number of pixels. If you study these equations carefully, you will notice that two intermediate 
sums must be calculated: the sum of all pixels and the sum of all pixel values squared. Once these quantities 
are known, the mean and standard deviation can be determined using simple arithmetic. The standard 
deviation equation that's detailed here is simple to calculate and suitable for this source code example. 

For other use cases, however, this same equation is often unsuitable for standard deviation calculations 
especially those that involve floating-point values. You may want to consult the statistical variance 
calculating references that are listed in Appendix A before using this equation in one of your own programs. 

Listing 14-4 begins with the C++ header file Ch14 04.h that includes the declaration of a structure 
named ImageStats. This structure is used to pass image data to the C++ and assembly language calculating 
functions and return results. A semantically equivalent structure is also defined in the assembly language 
file Ch14 04 .asm. The file Ch14 04.h also includes the constant definitions c_PixelValMin and c_ 
PixelValueMax, which define the range limits that a pixel value must fall between to be included in any 
statistical calculations. 

The function Avx512CalcImageStatsCpp is the principal calculating function in the C++ code. 

This function requires a pointer to an ImageStats structure as its sole argument. Following argument 
validation, Avx512CalcImageStatsCpp initializes the ImageStats intermediate sums m PixelSum, 

m PixelSumOfSquares, andm NumPixelsInRange to zero. A simple for loop follows, which calculates m_ 
PixelSum and m PixelSumOfSquares. During each loop iteration, pixel values are tested for in-range validity 
before being included in any calculations. Following computation of the intermediate sums, the function 
Avx512CalcImageStatsCpp calculates the final mean and standard deviation. Note that m NumPixelsInRange 
is used to calculate these statistical quantities instead of m NumPixels. The remaining code in Ch14 04.cpp 
performs test case initialization, invokes the calculating functions, and streams the results to cout. 

Toward the top of the file Ch14 04 .asmis the assembly language version ofthe structure ImageStats. 
This is followed by the macro definition UpdateSums whose inner workings will be described shortly. 

The function Avx512CalcImageStats begins its execution by performing the same argument validation 
checks as its C++ counterpart. It then initializes packed versions of the intermediate values PixelValMin 
and PixelValMax. The ensuing vpxorq instructions initialize packed quadword versions of PixelSum and 
PixelSumOfSquares to zero. Note that the vpxor [d |q] (and other AVX-512 bitwise Boolean) instructions 
can optionally specify an opmask operand register to perform merge or zero masking of doubleword or 
quadword elements. The final initialization instruction, xor r10d,r10d, sets NumPixelsInRange to zero. 

The processing loop in function Avx512CalcImageStats processes 64 pixels each iteration. Prior 
to the start of the processing loop, register RDX is tested to verify that at least 64 pixels remain. Each 
processing loop iteration begins with two vpxord instructions that initialize packed doubleword versions of 
pixel sumand pixel sum of squares to zero. Following this are four instances of the macro. UpdateSum, 
which process in aggregate the next group of 64 pixels. The first instruction of this macro, vpmovzxbd 
zmmO,xmmword ptr [rcx«Disp], loads 16 unsigned byte values from source pixel buffer and saves these 
values as unsigned doublewords in register ZMMO. The ensuing vpcmpud k1,zmmo,zmm31,CMP GE, vpcmpud 
k2,zmmo,zmm30,CMP LE, and kandw k3,k2,k1 instructions load opmask register K3 with a mask value 
of pixels that are greater than or equal to pixel val min and less than or equal to pixel val max. The 
vmovdqa32 zmm1{k3}{z},zmmO that follows uses zero masking to effectively eliminate out-of-range pixel 
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values from further calculations. The subsequent vpaddd and vpmulld instructions then update the packed 
doubleword quantities pixel sum and pixel sum of squares. The total number of in-range pixels in 

R10 is then updated using the instructions kmovw rax,k3, popcnt rax,rax, and add r10,rax. Figure 14-2 
illustrates these calculations in greater detail. Note that this figure shows only the low-order 256 bits of each 
ZMM registers and the low-order 8 bits of each opmask register. 


" Eu 
Initial values 


02 : zmm16[255:0] 


zmm17[255:0] 


vpmovzxbd zmm0,xmmword ptr [rcx*10h] 


zmm0[255:0] 


vpempud kl,zmm0,zmm31,CMP GE 
k1[7:0] (k1 = Ox7FF3) 


| 


vpcmpud k2,zmmO0,zmm30,CMP LE 
k2[7:0] (k2 = OxEF3F) 


kandw k3,k2,k1 
k3[7:0] (k3 = Ox6F33) 


vmovdqa32 zmm1{k3}{z},zmm0 


zmm1[255:0] 


vpaddd zmm16, zmm16, zmm1 


zmm16[255:0] 


vpmulld zmm2,zmmi,zmmi 
zmm?2([255:0] 


vpaddd zmm17,zmm17,zmm2 


zmmi7[255:0] 


kmovw rax,k3 
popcnt rax,rax 


add r10,rax n 


Figure 14-2. Calculations performed by instructions in macro UpdateSums 


O1 
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Following the four UpdateSums usages, the doubleword elements of registers ZMM16 and ZMM17 
contain packed copies of the values pixel sumand pixel sum of squares for the current block of 64 
pixels. The vextracti32x8 ymmO,zmm16,1 and vpaddd ymm1, ymmo, ymm16 instructions reduce the number 
of doubleword values in register ZMM16 from 16 to 8. The ensuing vpmovzxdq zmm2,ymm1 instruction 
promotes these doubleword values to quadwords, and the vpaddq zmm29,zmm29,zmm2 instruction updates 
the global packed quadword pixel sum values that are maintained in register ZMM29. A similar sequence 
of instructions is then used to update the global packed quadword pixel sum of squares values in register 
ZMM28. Following these instructions, the processing loop updates its pointer register and counters; it then 
repeats until the number of remaining pixels falls below 64. 

The block of code that starts at the label LB1 computes pixel sumand pixel sum of squares for 
the final few pixels (if any) using scalar integer arithmetic and the general-purpose registers. A series of 
extract (vextracti64x4, vextracti64x2, and vpextrq) and vpaddq instructions reduce the eight packed 
quadword pixel sum values in ZMM29 to a single quadword value. A similar sequence of instructions is 
then used to calculate the final value for pixel sum of squares. Note that these intermediate results are 
saved in the ImageStats structure pointed to by register RCX. The function Avx512CalcImageStats then 
executes a chain of vcvtusi2sd instructions to convert the intermediate results from unsigned quadword 
integers to double-precision floating-point. The final mean and standard deviation values are calculated 
using scalar double-precision floating-point arithmetic. The results for source code example Ch14 04 follow 
this paragraph. Table 14-1 shows benchmark timing measurements for the C++ and assembly language 
calculating functions Avx512CalcImageStatsCpp and Avx512CalcImageStats . 


Results for Avx512CalcImageStats 


image fn: ..\Ch14 Data NTestImage4.bmp 
num pixels: 258130 

c PixelValMin: 40 

c PixelValMax: 230 

m NumPixelsInRange: 229897 | 229897 

m PixelSum: 32574462 | 32574462 

m PixelSumOfSquares: 5139441032 | 5139441032 

m PixelMean: 141.691549 | 141.691549 
m PixelSd: 47.738056 | 47.738056 


Running benchmark function Avx512CalcImageStats BM - please wait 
Benchmark times save to file Ch14 04 Avx512CalcImageStats BM CHROMIUM.csv 


Table 14-1. Benchmark Timing Measurements for Image Statistics Calculating 
Functions Using TestImage4.bmp 


CPU  Avx5f2CalcmageStatsCpp  Avx512CalcimageStats_ 
i7-4790S site " 

i9-7900X 404 29 

i7-8700K - IN 


519 


CHAPTER 14 = AVX-512 PROGRAMMING — PACKED INTEGERS 


RGB to Grayscale Conversion 


In Chapter 10, you learned how to use the AVX2 instruction set to convert an RGB image to a grayscale image 
(see example Ch10 06). Listing 14-5 shows the source code for example Ch14 05, which illustrates RGB to 
grayscale image conversion using the AVX-512 instruction set. 


Listing 14-5. Example Ch14_ 05 


#include "stdafx.h" 
#include <iostream> 
#include <stdexcept> 
#include <iomanip> 
#include "Ch14 05.h" 
#include "ImageMatrix.h" 
#include "AlignedMem.h" 


using namespace std; 


extern "C" const int c NumPixelsMin = 64; 
extern "C" const int c NumPixelsMax = 16 * 1024 * 1024; 


// RGB to grayscale conversion coefficients 
const float c Coef[3] (0.2126f, 0.7152f, 0.0722f}; 


bool CompareGsImages(const uint8 t* pb gsi,const uint8 t* pb gs2, int num pixels) 


{ 


for (int i = 0; i < num pixels; i++) 


if (abs((int)pb gs1[i] - (int)pb gs2[i]) > 1) 
return false; 


) 


return true; 


) 


bool Avxsi2RgbToGsCpp(uint8 t* pb gs, const uint8 t* const* pb rgb, int num pixels, const 
float coef[3]) 
( 
if (num pixels « c NumPixelsMin || num pixels » c NumPixelsMax) 
return false; 
if (num pixels % 64 !- 0) 
return false; 
if (!AlignedMem::IsAligned(pb gs, 64)) 
return false; 


const size t align - 64; 


const uint8 t* pb r = pb rgb[o]; 
const uint8 t* pb g - pb rgb[1]; 
const uint8 t* pb b - pb rgb[2]; 
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if (!AlignedMem::IsAligned(pb r, align)) 
return false; 

if (!AlignedMem::IsAligned(pb g, align)) 
return false; 

if (!AlignedMem::IsAligned(pb b, align)) 
return false; 


for (int i = 0; i < num pixels; i++) 
{ 
uint8 t r 
uint8 t g 
uint8 t b 
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float gs temp = r * coef[0] + g * coef[1] + b * coef[2] + 0.5f; 


if (gs temp « 0.0f) 
gs temp = 0.0f; 

else if (gs temp » 255.0f) 
gs temp - 255.0f; 


pb gs[i] = (uint8 t)gs temp; 
} 


return true; 


} 


void Avx512RgbToGs(void) 

1 
const wchar t* fn rgb = L"..W Ch414 Data\\TestImage3.bmp"; 
const wchar t* fn gs1 = L"Ch14 05 Avx512RgbToGs TestImage3 GS1.bmp"; 
const wchar t* fn gs2 = L"Ch14 05 Avx512RgbToGs TestImage3 GS2.bmp"; 
const wchar t* fn gs3 = L"Ch14 05 Avx512RgbToGs TestImage3 GS3.bmp"; 


ImageMatrix im rgb(fn rgb); 

int im h = im rgb.GetHeight(); 

int im w = im rgb.GetWidth(); 

int num pixels - im h * im w; 

ImageMatrix im r(im h, im w, PixelType::Gray8); 
ImageMatrix im g(im h, im w, PixelType::Gray8); 
ImageMatrix im b(im h, im w, PixelType::Gray8); 
RGB32* pb rgb = im rgb.GetPixelBuffer«RGB32»(); 
uint8 t* pb r - im r.GetPixelBuffer«uint8 t»(); 
uint8 t* pb g = im g.GetPixelBuffer«uint8 t»(); 
uint8 t* pb b - im b.GetPixelBuffer«uint8 t»(); 
uint8 t* pb rgb cp[3] (pb r, pb g, pb bj; 


for (int i = 0; i « num pixels; i++) 


{ 


pb rgb[i].m R; 
pb rgb[i].m G; 
pb rgb[i].m B; 


pb rgb cp[o][i] 
pb rgb cp[1][i] 
pb rgb cp[2][i] 
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ImageMatrix im gsi(im h, im w, PixelType::Gray8); 
ImageMatrix im gs2(im h, im w, PixelType::Gray8); 
ImageMatrix im gs3(im h, im w, PixelType::Gray8); 
uint8 t* pb gs1 = im gs1.GetPixelBuffer<uint8 t»(); 
uint8 t* pb gs2 - im gs2.GetPixelBuffer«uint8 t»(); 
uint8 t* pb gs3 - im gs3.GetPixelBuffer«uint8 t»(); 


// Exercise conversion functions 

bool rci = Avx512RgbToGsCpp(pb_gs1, pb rgb cp, num pixels, c Coef); 
bool rc2 = Avx512RgbToGs (pb gs2, pb rgb cp, num pixels, c Coef); 
bool rc3 - Avx2RgbToGs (pb gs3, pb rgb cp, num pixels, c Coef); 


if (rci 8& rc2 && rc3) 


{ 
im_gs1.SaveToBitmapFile(fn_gs1); 
im_gs2.SaveToBitmapFile(fn_gs2); 
im_gs2.SaveToBitmapFile(fn_gs3); 
bool c1 = CompareGsImages(pb_gs1, pb_gs2, num_pixels); 
bool c2 = CompareGsImages(pb_gs2, pb_gs3, num_pixels); 
if (c1 88 c2) 
cout << "Grayscale image compare OK\n"; 
else 
cout << "Grayscale image compare failed\n"; 
} 
else 
cout << "Invalid return code\n"; 
} 
int main() 
{ 
try 
{ 
Avx512RgbToGs(); 
Avx512RgbToGs_BM(); 
} 
catch (runtime error& rte) 
{ 
cout << "'runtime_error' exception has occurred - " << rte.what() << 
} 
catch (...) 
{ 
cout << "Unexpected exception has occurred\n"; 
} 
return 0; 
i 
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include <MacrosX86-64-AVX.asmh> 
extern c NumPixelsMin:dword 
extern c NumPixelsMax:dword 


.const 


r4_0p5 real4 0.5 


r4_255p0 real4 255.0 
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; extern "C" bool Avx512RgbToGs_(uint8_t* pb gs, const uint8_t* const* pb_rgb, 


int num_pixels, const float coef[3]); 


.code 


Avx512RgbToGs_ proc frame 
_CreateFrame RGBGSO ,0,96,r13,r14,r15 
_SaveXmmRegs xmm1O,xmm11,xmm12,xmm13,xmm14,xmm15 


_EndProlog 


XOI eax,eax 


cmp r8d,[c NumPixelsMin] 


jl Done 


cmp r8d,[c NumPixelsMax] 


jg Done 
test r8d,3fh 
jnz Done 


test rcx,3fh 
jnz Done 


mov r13, [rdx] 


test r13,3fh 
jnz Done 


mov r14, [rdx«8] 


test r14,3fh 
jnz Done 


mov r15,[rdx«16] 


test r15,3fh 
jnz Done 


; Perform required initializations 


vbroadcastss 
vbroadcastss 
vbroadcastss 
vbroadcastss 
vbroadcastss 


vxorps zmmis, 


mov r8d,r8d 
mov r10,16 


zmm10,real4 ptr 
zmm11,real4 ptr 
zmm12,real4 ptr 
zmm13,real4 ptr 
zmm14,real4 ptr 
zmm15 , zmm15 


[ 
[ 
[ 
[ 
[ 


;error return code (also pixel buffer offset) 


;jump if num pixels < min value 


;jump if num pixels > max value 


;jump if (num pixels % 64) != 0 


;junp if pb gs is not aligned 


;jump if pb r is not aligned 


;jump if pb g is not aligned 


;jump if pb b is not aligned 


r9] 

19+4] 
19+8] 

r4 0p5] 
I4 255p0] 


;zmm10 = packed coef[O0] 

;zmm11 - packed coef[1] 

;zmm12 - packed coef[2] 

;zmm13 = packed 0.5 

;zmm14 = packed 255.0 

;zmm15 = packed 0.0 

518 = num pixels 

;110 - number of pixels / iteration 
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; Load next block of pixels 


align 16 

QQ: vpmovzxbd zmmo,xmmword ptr [r13+rax] ;zmmO = 16 pixels (r values) 
vpmovzxbd zmm1,xmmword ptr [r14+rax] ;zmmi = 16 pixels (g values) 
vpmovzxbd zmm2,xmmword ptr [r15+rax] ;zmm2 - 16 pixels (b values) 


; Convert dword values to SPFP and multiply by coefficients 


vcvtdq2ps zmmo,zmmo ;zmmO = 16 pixels SPFP (r values) 
vcvtdq2ps zmmi,zmmi ;zmmi = 16 pixels SPFP (g values) 
vcvtdq2ps zmm2,zmm2 ;zmm2 - 16 pixels SPFP (b values) 
vmulps zmmo,zmmo,zmm10 ;zmmo = r values * coef[0] 
vmulps zmm1,zmm1, zmm11 ;zmmi = g values * coef[1] 
vmulps zmm2,zmm2,zmm12 ;zmm2 - b values * coef[2] 


; Sum color components & clip values to [0.0, 255.0] 


vaddps zmm3,zmmO, zmm1 Dg 

vaddps zmm4,zmm3, zmm2 or*geb 

vaddps zmm5,zmm4,zmm13 ja r*gtbe0.5 

vminps zmmo, zmm5,zmm14 ;clip pixels above 255.0 
vmaxps zmmi,zmmo,zmmi5 ;clip pixels below 0.0 


; Convert grayscale values from SPFP to byte, save results 
vcvtps2dq zmm2,zmmi ;convert SPFP values to dwords 


vpmovusdb xmm3,zmm2 ;convert to bytes 
vmovdqa xmmword ptr [rcx«rax],xmm3  ;save grayscale image pixels 


add rax,r10 
sub r8,r10 
jnz GB 


mov eax,1 ;set success return code 
Done:  vzeroupper 
 RestoreXmmRegs xmm10,xmm11,xmm12,xmm13,xmm14,xmm15 
_DeleteFrame 113,114,715 
ret 
Avx512RgbToGs_ endp 


3 extern "C" bool Avx2RgbToGs (uint8 t* pb gs, const uint8 t* const* pb rgb, int num pixels, 
const float coef[3]); 


.code 

Avx2RgbToGs proc frame 
_CreateFrame RGBGS1 ,0,96,113,114,r15 
 SaveXmmRegs xmm10,xmm11,xmm12,xmm13,xmm14, xmm15 


 EndProlog 

xor eax,eax ;error return code (also pixel buffer offset) 
cmp r8d,[c NumPixelsMin] 

jl Done ;jump if num pixels < min value 


cmp r8d,[c NumPixelsMax] 
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jg Done 
test r8d,3fh 
jnz Done 


test rcx,3fh 
jnz Done 


mov r13, [rdx] 
test r13,3fh 
jnz Done 


mov r14, [rdx«8] 


test r14,3fh 
jnz Done 
mov r15, 
test r15,3fh 
jnz Done 


[rdx«16] 


; Perform required initializations 


vbroadcastss 
vbroadcastss 
vbroadcastss 
vbroadcastss 
vbroadcastss 
vxorps ymmi5, 
mov r8d,r8d 
mov r10,8 


ymm10, real4 
ymm11,real4 
ymm12,real4 
ymm13,real4 
ymm14,real4 
ymm15 , ymm15 


; Load next block of pixels 


QQ: 


align 16 


vpmovzxbd ymmO,qword ptr 
vpmovzxbd ymm1,qword ptr [r14+rax] 
vpmovzxbd ymm2,qword ptr [r15+rax] 
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;jump if num pixels > max value 


;jump if (num pixels % 64) !- 0 


;junp if pb gs is not aligned 


;jump if pb r is not aligned 


;jump if pb g is not aligned 


;jump if pb b is not aligned 


ptr [r9] ;ymm10 = packed coef[O0] 
ptr [r9+4] ;ymm11 - packed coef[1] 
ptr [r9+8] ;ymm12 - packed coef[2] 
ptr [r4 Op5]  ;ymmi13 = packed 0.5 
ptr [r4 255p0] ;ymm14 = packed 255.0 
3ymm15 = packed 0.0 


[r13+rax] 


518 = num pixels 
3110 - number of pixels / iteration 


;ymmo = 8 pixels (r values) 
;ymmi = 8 pixels (g values) 
;ymm2 - 8 pixels (b values) 


; Convert dword values to SPFP and multiply by coefficients 
vcvtdq2ps ymmo, ymmo 
vcvtdq2ps ymm1,ymm1i 
vcvtdq2ps ymm2,ymm2 
vmulps ymmo, ymmo, ymm10 
vmulps ymmi,ymmi,ymm11 
vmulps ymm2,ymm2,ymm12 


;ymmo = 8 pixels SPFP (r values) 
;ymmi = 8 pixels SPFP (g values) 
;ymm2 = 8 pixels SPFP (b values) 
;ymmo = r values * coef[o0] 
;ymmi - g values * coef[1] 
;ymm2 - b values * coef[2] 


; Sum color components & clip values to [0.0, 255.0] 
vaddps ymm3, ymmo, ymm1 
vaddps ymm4,ymm3, ymm2 
vaddps ymm5,ymm4,ymm13 
vminps ymmo, ymm5,ymm14 
vmaxps ymmi, ymmo, ymm15 


rg 
or*geb 

a r*gtbe0.5 

;clip pixels above 255.0 
;clip pixels below 0.0 
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; Convert grayscale components from SPFP to byte, save results 
vcvtps2dq ymm2,ymm1 ;convert SPFP values to dwords 


vpackusdw ymm3,ymm2,ymm2 
vextracti128 xmm4,ymm3,1 


vpackuswb xmm5 , xmm3 , xmm4 ;byte GS pixels in xmm5[31:0] and xmm5[95:64] 
vpextrd riid,xmm5,0 ;r11d = 4 grayscale pixels 

mov dword ptr [rcx+rax],r11d ;save grayscale image pixels 

vpextrd riid,xmm5,2 ;r11d = 4 grayscale pixels 

mov dword ptr [rcx+rax+4],r11d ;save grayscale image pixels 


add rax,r10 
sub r8,r10 
jnz GB 


mov eax,1 ;set success return code 
Done:  vzeroupper 
 RestoreXmmRegs xmm10,xmm311,xmm12,xmm13,xmm14,xmm15 
_DeleteFrame 113,114,715 
ret 
Avx2RgbToGs endp 
end 


The algorithm that’s used in this example to perform RGB to image grayscale conversion is the same 
one that was used in Ch10_06. As explained in Chapter 10, the algorithm uses a simple weighted average 
to transform an RGB image pixel into a grayscale image pixel. The C++ function Avx512RgbToGs begins its 
execution by loading the test image file. It then copies the RGB pixels of im_rgb into three separate color 
component image buffers. The reason for doing this is that this example’s RGB to grayscale conversion 
functions require a structure of arrays (AOS) instead of an array of structures (SOA), which was employed in 
source code example Ch10 06. Following allocation of the grayscale image buffers, Avx512RgbToGs invokes 
the C++ and assembly language conversion functions. The resultant grayscale image buffers are then 
compared for equality and saved. 

The assembly language code in Listing 14-5 includes two functions: Avx512Rgb2Gs and Avx2Rgb2Gs_. 
As implied by their respective name prefixes, these functions perform RGB to grayscale image conversions 
using AVX-512 and AVX2 instructions, respectively. The function Avx512Rgb2Gs_ begins its execution by 
validating num_pixels for size and divisibility by 64. It then checks the source and destination pixel buffers 
for proper alignment. The ensuing series of vbroadcastss instructions load packed versions of the color 
conversion coefficients into registers ZMM10, ZMM11, and ZMM12. This is followed by another set of 
vbroadcastss instructions that broadcast the single-precision floating-point constants 0.5, 255.0, and 0.0 to 
registers ZMM13, ZMM14, and ZMM15. The mov r8d,r8d instruction zero-extends num pixels into R8, and 
the mov r10,16 instruction loads R10 with the number of pixels to process during each loop iteration. 

Each Avx512Rgb2Gs_ processing loop iteration in starts with three vpmovzxbd instructions that load 
16 red, green, and, blue pixel values into registers ZMM0, ZMMI, and ZMM2. The ensuing vcvtdq2ps 
instructions convert the doubleword pixel values to single-precision floating-point. The floating-point color 
values are then multiplied by the corresponding color coefficients using a series of vmulps instructions. 
These values are then summed using three vaddps instructions. The resultant 16 grayscale pixel values are 
then clipped to [0.0, 255.0] and converted to doubleword values. The vpmovusdb xmm3,zmm2 instruction 
size-reduces the doubleword values to bytes using unsigned saturation, and the vnovdqa xmmword ptr 
[rcxerax], xmm3 instruction saves the 16 byte pixel values to the destination grayscale image buffer. 
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The assembly language function Avx2Rgb2Gs is identical to its AVX-512 counterpart except for two 
minor changes: Avx2Rgb2Gs_ uses AVX2 instructions and the YMM register set to carry out the required 
calculations; it also uses the vpackusdw and vpackuswb instructions in conjunction with a few other 
instructions to perform the doubleword to byte size reductions. The reason for this is that AVX2 does not 
support the vpmovusdb instruction. Here is the output for source code example Ch14 05: 


Grayscale image compare OK 


Running benchmark function Avx512RgbToGs BM - please wait 
Benchmark times save to file Ch14 05 Avx512RgbToGs BM CHROMIUM.csv 


Table 14-2 shows the benchmark timing measurements for source code example Ch14 05. 


Table 14-2. Mean Execution Times (Microseconds) for RGB to Grayscale Image Conversion 
Using TestImage3.bmp 


CPU Avx512RgbToGsCpp Avx512Rgb2Gs Avx2Rgb26s . 
i7-4790S m = E 
i9-7900X 1125 134 259 
i7-8700K ---- ae eee 


The benchmark time differences between the AVX-512 and AVX2 implementations of the RGB to grayscale 
conversion algorithm are consistent with what one might expect. It is interesting to compare these numbers 
with the benchmark timing measurements from source code example Ch10_06 (see Table 10-2). This earlier 
example used an array of RGB32 pixels (or AOS) for the source image buffer, and the mean execution time for 
the conversion function Avx2ConvertRgbToGs_ was 593 microseconds. The current example exploits separate 
image pixel buffers for each color component (or SOA), which significantly improves performance. 


Summary 


Here are the key learning points for Chapter 14: 


e Assembly language functions can use AVX-512 promoted versions of most AVX and 
AVX2 packed integer instructions to perform operations using 512-, 256-, and 128-bit 
wide operands. 


e Assembly language functions can use the vmovdqa[32|64] and 
vmovdqu[8|16|32|64] to perform aligned and unaligned moves of packed integer 
operands. 


e Assembly language functions can use the vpmovus [qd | qu| qb | du | db | wb ] instructions 
to carry out packed integer size reductions using unsigned saturation. AVX-512 also 
supports an analogous set of packed integer size-reducing instructions using signed 
saturation. 


e The vpcmpu[b|w|d|q] instructions perform packed unsigned integer compare 
operations and save the resultant compare mask to an opmask register. 
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The vpand[d|q], vpandn[d|q], vpor[d|q], and vpxor[d|q] instructions can be used 
with an opmask register to perform merge or zero masking using doubleword or 
quadword elements. 


The vextracti[32x4|32x8 |64x2 | 64x4] instructions can be used to extract packed 
doubleword or quadword values from a packed integer operand. 


When performing SIMD calculations using packed integer or floating-point 
operands, a structure-of-arrays construct is often significantly faster than an array- 
of-structures construct. 


CHAPTER 15 


Optimization Strategies and 
Techniques 


In the preceding chapters, you learned the fundamentals of x86-64 assembly language programming. You 
also learned how to use the computational recourses of Advanced Vector Extensions to perform SIMD 
operations. To maximize the performance of your x86 assembly language code, it is often necessary to 
understand important details about the inner workings of an x86 processor. In this chapter, you'll explore the 
internal hardware components of a modern x86 multi-core processor and its underlying microarchitecture. 
You'll also learn how to apply specific coding strategies and techniques to boost the performance of your 
x86-64 assembly language code. 

The content of Chapter 15 should be regarded as an introductory tutorial of its topics. A comprehensive 
examination of x86 microarchitectures and assembly language optimization techniques would minimally 
require several lengthy chapters, or conceivably an entire book. The primary reference source for this 
chapter's material is the Intel 64 and IA-32 Architectures Optimization Reference Manual. You are encouraged 
to consult this important reference guide for additional information regarding Intel's x86 microarchitectures 
and assembly language optimization techniques. The AMD manual Software Optimization Guide for AMD 
Family 17h Processors also contains useful optimization guidance for x86 assembly language programmers. 
Appendix A includes additional references that contain more information regarding x86 assembly language 
optimization strategies and techniques. 


Processor Microarchitecture 


The performance capabilities of an x86 processor are principally determined by its underlying 
microarchitecture. A processor's microarchitecture is characterized by the organization and operation of 
the following internal hardware components: instruction pipelines, decoders, schedulers, Execution Units, 
data buses, and caches. Software developers who understand the basics of processor microarchitectures can 
often glean constructive insights that enable them to develop more efficient code. 

This section explains processor microarchitecture concepts using Intel's Skylake microarchitecture 
as an illustrative example. The Skylake microarchitecture is utilized in recent mainstream processors 
from Intel, including sixth generation Core i3, i5, and i7 series CPUs. Seventh (Kaby Lake) and eighth 
(Coffee Lake) generation Core series CPUs are also based on the Skylake microarchitecture. The 
structural organization and operation of earlier Intel microarchitectures such as Sandy Bridge and 
Haswell are comparable to Skylake. Most of the concepts presented in this section are also applicable 
to microarchitectures developed and used by AMD in its processors, although the underlying hardware 
implementations vary. Before proceeding it should be noted that the Skylake microarchitecture discussed in 
this section is similar to but different from the Skylake Server microarchitecture that was referenced in the 
chapters that described AVX-512 concepts and programming. 
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Processor Architecture Overview 


The architectural details of a processor based on Skylake or any other modern microarchitecture are best 
examined using the framework of a multi-core processor. Figure 15-1 shows a simplified block diagram 
ofarepresentative Skylake-based quad-core processor. Note that each CPU core includes first-level (L1) 
instruction and data caches, which are labeled I-Cache and D-Cache. As implied by their names, these 
memory caches contain instructions and data that a CPU core can access rapidly. Each CPU core also 
includes a second-level (L2) unified cache, which holds both instructions and data. Besides improving 
performance, the L1 and L2 caches enable the CPU cores to execute independent instruction streams in 
parallel without having to access the higher-level L3 shared cache or main memory. 


L3 Shared Cache 
(size varies) 


Figure 15-1. Simplified block diagram of a Skylake-based quad-core processor 
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If a CPU core requires an instruction or data item that is not present in its L1 or L2 cache, it must be 
loaded from the L3 cache or main memory. A processor's L3 cache is partitioned into multiple "slices" Each 
slice consists of a logic controller and data array. The logic controller manages access to its corresponding 
data array. It also handles cache misses and writes to main memory. A cache miss occurs when requested 
data is not present in the L3 cache and must be loaded from main memory (cache misses also occur when 
data is not available in the L1 or L2 caches). Each L3 data array includes cache memory, which is organized 
into 64-byte wide packets called cache lines. The Ring Interconnect is a high-speed internal bus that 
facilitates data transfers between the CPU cores, L3 cache, graphics unit, and System Agent. The System 
Agent handles data traffic among the processor, its external data buses, and main memory. 


Microarchitecture Pipeline Functionality 


During program execution, a CPU core performs five elementary instructional operations: fetch, decode, 
dispatch, execute, and retire. The particulars of these operations are determined by the functionality of the 
microarchitecture's pipeline. Figure 15-2 shows a streamlined block diagram of pipeline functionality in a 
Skylake-based CPU core. In the paragraphs that follow, the operations performed by these pipeline units are 
examined in greater detail. 


Instruction Fetch 
and 
Pre-Decode 


Instruction 
Queue 


$^ Decoded 5 
256K L2 Unified insteucton Cache Instruction 


Cache (Micro-Ops) Decoders 


Branch Prediction Micte-0b 


Ging Instruction Loop Stream 
Queue Detector 


Load Buffers 
Store Buffers 
Reorder Buffers 


Allocate/Rename Retire Unit 
(Dataflow Reorder) (Program Reorder) 


In-Order 
Out-of-Order 


Scheduler 


BE IMS! 
tte _ųțăġűE 


Execution Engine 


Figure 15-2. Skylake CPU core pipeline functionality 


531 


CHAPTER 15 bn OPTIMIZATION STRATEGIES AND TECHNIQUES 


The Instruction Fetch and Pre-Decode Unit grabs instructions from the L1 I-Cache and begins the 
process of preparing them for execution. Steps performed during this stage include instruction length 
resolution, decoding of x86 instructional prefixes, and property marking to assist the downstream decoders. 
The Instruction Fetch and Pre-Decode Unit is also responsible for feeding a constant stream of instructions 
to the Instruction Queue, which queues up instructions for presentation to the Instruction Decoders. 

The Instruction Decoders translate x86 instructions into micro-ops. A micro-op is a self-contained 
low-level instruction that is ultimately executed by one of the Execution Engine's Execution Units, which are 
discussed in the next section. The number of micro-ops generated by the decoders for an x86 instruction 
varies depending on its complexity. Simple register-register instructions such as add eax,edx and vpxor 
xmmo , xmmO , xmmO are decoded into a single micro-op. Instructions that perform more complex operations, 
such as idiv rex and vdivsd ymmo, ymm1, ymm2, require multiple micro-ops. The translation of x86 
instructions into micro-ops facilitates several architectural and performance benefits, including instruction- 
level parallelism and out-of-order executions. 

The Instruction Decoders also perform two ancillary operations that improve utilization of available 
pipeline bandwidth. The first of these operations is called micro-fusion, which combines simple micro-ops from 
the same x86 instruction into a single complex micro-op. Examples of micro-fused instructions include memory 
stores (mov dword ptr [rbx+16],eax) and calculating instructions that reference operands in memory (sub 
r9,qword ptr [rbp+48]). Fused complex micro-ops are dispatched by the Execution Engine multiple times 
(each dispatch executes a simple micro-op from the original instruction). The second ancillary operation carried 
out by the Instruction Decoders is called macro-fusion. Macro-fusion combines certain commonly-used x86 
instruction pairs into a single micro-op. Examples of macro-fusible instruction pairs include many (but not all) 
conditional jump instructions that are preceded by an add, and, cmp, dec, inc, sub, or test instruction. 

Micro-ops from the Instruction Decoders are transferred to the Micro-Op Instruction Queue for eventual 
dispatch by the Scheduler. They're also cached, when necessary, in the Decoded Instruction Cache. The Micro- 
Op Instruction Queue is also used by the Loop Stream Detector, which identifies and locks small program loops 
in the Micro-Op Instruction Queue. This improves performance since a small loop can repeatedly execute 
without requiring any additional instruction fetch, decode, and micro-op cache read operations. 

The Allocate/Rename block serves as a bridge between the in-order front-end pipelines and the 
out-of-order Scheduler and Execution Engine. It allocates any needed internal buffers to the micro-ops. 

It also eliminates false dependencies between micro-ops, which facilitates out-of-order execution. A false 
dependency occurs when two micro-ops need to simultaneously access distinct versions of the same 
hardware resource. (In assembly language code, false dependencies can occur when using instructions that 
update only the low-order 8 or 16 bits of a 32-bit register.) Micro-ops are then transferred to the Scheduler. 
This unit queues micro-ops until the necessary source operands are available. It then dispatches ready- 
to-execute micro-ops to the appropriate Execution Unit in the Execution Engine. The Retire Unit removes 
micro-ops that have completed their execution using the program's original instruction-ordering pattern. It 
also signals any processor exceptions that may have occurred during micro-op execution. 

Finally, the Branch Prediction Unit helps select the next set of instructions to execute by predicting the 
branch targets that are most likely to execute based on recent code execution patterns. A branch target is 
simply the destination operand ofa transfer control instruction, such as jcc, jmp, call, or ret. The Branch 
Prediction Unit enables a CPU core to speculatively execute the micro-ops of an instruction before the 
outcome of a branch decision is known. When necessary, a CPU core searches (in order) the Decoded 
Instruction Cache, L1 I-Cache, L2 Unified Cache, L3 Cache, and main memory for instructions to execute. 


Execution Engine 


The Execution Engine executes micro-ops passed to it by the Scheduler. Figure 15-3 shows a high-level block 
diagram of a Skylake-based CPU core Execution Engine. The rectangular blocks beneath each dispatch 

port represent distinct micro-op Execution Units. Note that four of the Scheduler ports facilitate access to 
Execution Units that carry out calculating functions including integer, floating-point, and SIMD arithmetic. 
The remaining four ports support memory load and store operations. 
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Figure 15-3. Skylake CPU core Execution Engine and its Execution Units 


Each Execution Unit performs a specific calculation or operation. For example, the Integer ALU 
(Arithmetic Logic Unit) Execution Units carry out integer addition, subtraction, and compare operations. 
The Vector ALU Execution Units handle SIMD integer arithmetic and bitwise Boolean operations. Note 
that the Execution Engine contains multiple instances of select Execution Units. This allows the Execution 
Engine to simultaneously execute multiple instances of certain micro-ops in parallel. For example, the 
Execution Engine can concurrently perform three separate SIMD bitwise Boolean operations in parallel 
using the Vector ALU Execution Units. 

Each Skylake core Scheduler can dispatch a maximum of eight micro-ops per cycle (one per port) to 
the Execution Engine. The out-of-order engine, which includes the Scheduler, Execution Engine, and Retire 
Unit, supports up to 224 "in-flight" (or coexistent) micro-ops. Table 15-1 shows key buffer sizes for recent 
Intel microarchitectures. 


533 


CHAPTER 15 bn OPTIMIZATION STRATEGIES AND TECHNIQUES 


Table 15-1. Comparison of Key Buffer Sizes for Recent Intel Microarchitectures 


Parameter Sandy Bridge (2" Gen) Haswell (4^ Gen) Skylake (6^ Gen) 
Dispatch ports 6 8 8 

In-flight micro-ops 168 192 224 

In-flight loads 64 72 72 

In-flight stores 36 42 56 

Scheduler entries 54 60 97 

Integer register file 160 168 180 
Floating-point register file 144 168 168 


Optimizing Assembly Language Code 


This section discusses some basic optimization strategies and techniques that you can use to improve the 
performance of your x86 64-bit assembly language code. These techniques are recommended for use in 
code that targets recent Intel microarchitectures, including Skylake Server, Skylake, Haswell, and Sandy 
Bridge. Most of techniques are also appropriate for use in code that will execute on recent AMD processors. 
The optimization strategies and techniques are organized into five generic categories: 


e Basic techniques 

e Floating-point arithmetic 
e Program branches 

e Data alignment 

e SIMD techniques 


It is important to keep in mind that the optimization techniques mentioned in this section must 
be applied in a prudent manner. For example, it makes little sense to add extra push and pop (or other) 
instructions to a function just to use recommended instruction form only once. Moreover, none of the 
optimization strategies and techniques described in this section will remedy an inappropriate or poorly 
designed algorithm. The Intel 64 and IA-32 Architectures Optimization Reference Manual contains additional 
information regarding the optimization strategies techniques discussed in this section. Appendix A also 
contains additional references that you can consult for more information regarding optimization of x86 
assembly language code. 


Basic Techniques 


The following coding strategies and techniques are frequently employed to improve the performance of x86- 
64 assembly language code. 


e Use atest instruction instead of a cmp instruction when possible, especially to carry 
out a simple less than, equal to, or greater than zero test. 


e Avoid using the memory-immediate forms of the cmp and test instructions (e.g., 
cmp dword ptr [rbp+40],100ortest byte ptr [r12],0fh). Instead, load the 
memory value into a register and use the register-immediate form of the cmp or test 
instruction (e.g.,mov eax,dword ptr [rbp+40] followed by cmp eax, 100). 
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Minimize use of instructions that perform partial updates of the status flags in 
RFLAGS. For example, the instructions add eax,1 or sub rax,1 may be faster than 
inc eaxordec rax, especially in performance-critical loops (the inc and dec 
instructions do not update RFLAGS.CF). 


Use an xor or sub instruction to zero a register instead of a mov instruction. For 
example, use an xor eax,eaxor sub eax,eax instruction instead of mov eax,0. The 
mov instruction form can be used when it's necessary to avoid modifying the status 
flags in RFLAGS. 


Avoid using instructions that require an operand-size prefix to load a 16-bit 
immediate value since instructions with operand-size prefixes take longer to decode. 
Use an equivalent 32-bit immediate value instead. For example, use mov edx,42 
instead of mov dx,42. 


Use 32-bit instead of 64-bit instruction forms and general-purpose registers when 
possible. For example, if the maximum number of for-loop iterations does not 
exceed the range limits of a 32-bit integer, use a 32-bit instead of a 64-bit general- 
purpose register for the loop counter. 


Use 32-bit instruction forms to load 64-bit registers with positive constant values. For 
example, the instructions mov eax,16 and mov r8d,42 effectively set RAX to 16 and 
R8 to 42. 


Use the two- or three-operand form of the imul instruction to multiply two signed 
integers when the full-width product is not needed. For example, use imul rax,rcx 
when a 64-bit truncated product is sufficient instead of imul rcx, which returns a 
128-bit product in RDX:RAX. This guideline also applies to 32-bit signed integer 
multiplication. 


Avoid declaring data values inside a code section. In situations where it's necessary 
to do this (e.g., when defining a read-only jump table), position the data after an 
unconditional jmp or ret instruction. 


In performance-critical processing loops, minimize use of the lea instruction that 
contains three effective address components (e.g., base register, index register, 
and displacement). These instructions can only be dispatched to the Slow LEA 
Execution Unit through Port 1. Shorter forms (one or two effective address 
components) of the lea instruction can be dispatched via ports 1 or 5 to one of the 
Fast LEA Execution Units. 


Load any memory values that are needed for multiple calculations into a register. If a 
memory value is needed only for a single calculation, use the register: memory form 
of the calculating instruction. Table 15-2 shows several examples. 
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Table 15-2. Instruction Form Examples for Single and Multiple-Use Memory Values 


Register-Memory Form Move and Register-Register Form 
(Single-Use Data) (Multiple-Use Data) 
add edx,dword ptr [x] mov eax,dword ptr [x] 


add edx,eax 


and rax,qword ptr [rbx«16] mov rcx, [rbx«16] 
and rax,rcx 


cmp ecx,dword ptr [n] mov eax,dword ptr [n] 
cmp ecx,eax 


vmulpd xmmo,xmm2,xmmword ptr [rdx] vmovapd xmm1,xmmword ptr [rdx] 
vmulpd xmmO,xmm2,xmm1 


Floating-Point Arithmetic 


The following coding strategies and techniques can be employed to improve the performance of x86-64 
assembly language code that performs floating-point operations. These guidelines apply to both scalar and 


packed floating-point calculations. 


e Always use the computational resources of x86-AVX to perform scalar floating-point 
arithmetic. Do not use the legacy x87 floating-point unit to perform these types of 
calculations. 


e  Usesingle-precision floating-point values instead of double-precision values 
whenever possible. 


e Arrange floating-point instruction sequences to minimize register dependencies. 
Exploit multiple destination registers to save intermediate results, then reduce the 
intermediate results to a single value (see example Ch11 01). 


e Partially (or completely) unroll processing loops that contain floating-point 
calculations, especially loops that contain sequences of floating-point addition, 
multiplication, or FMA operations. 


e Avoid arithmetic underflows and denormal values during arithmetic calculations 
whenever possible. 


e Avoid using denormalized floating-point constants. 


e If excessive arithmetic underflows are expected, consider enabling the flush-to-zero 
(MXCSR.FTZ) and denormals-are-zero (MXCSR.DAZ) modes. See Chapter 4 for 
more information regarding the proper use of these modes. 


Program Branches 


Program branch instructions, such as jmp, call, and ret, are potentially time-consuming operations to 
perform since they can affect the contents of the front-end pipelines and internal caches. The conditional 
jump instruction jcc is also a performance concern given its frequency of use. The following optimization 
techniques can be employed to minimize the adverse performance effects of branch instructions and 


improve the accuracy of the Branch Prediction Unit: 
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e Organize code to minimize the number of possible branch instructions. 


e Partially (or completely) unroll short processing loops to minimize the number of 
executed conditional jump instructions. Avoid excessive loop unrolling since this 
may result in slower executing code due to less efficient use of the Loop Stream 
Detector (see Figure 15-2). 


e Eliminate unpredictable data-dependent branches using the setcc or cmovcc 
instructions. 


e Align branch targets in performance-critical loops to 16-byte boundaries. 


e Move conditional code that is unlikely to execute (e.g., error-handling code) to 
another program (or .code) section or memory page. 


The Branch Prediction Unit employs both static and dynamic techniques to predict the target of a 
jump instruction. Incorrect branch predictions can be minimized if blocks of code containing conditional 
jump instructions are arranged such that they're consistent with the Branch Prediction Unit's static 
prediction algorithm: 


e Use forward conditional jumps when the fall-through code is more likely to execute. 
e Use backward conditional jumps when the fall-through code is less likely to execute. 


The forward conditional jump approach is frequently used in blocks of code that perform function 
argument validation. The backward conditional jump technique often employed at the bottom of a 
processing loop code block following a counter update or other loop-terminating test decision. Listing 15-1 
contains a short assembly language function that illustrates these practices in greater detail. 


Listing 15-1. Example Ch15 01 


.const 
r8 2p0 real8 2.0 


; extern "C" int CalcResult (double* y, const double* x, size t n); 


.code 
CalcResult proc 


; Forward conditional jumps are used in this code block since 
; the fall-through cases are more likely to occur 


test r8,r8 

jz Done ;junp if n == 0 

test r8,7h 

jnz Error ;jump if (n % 8) !- 0 

test rcx, 1fh 

jnz Error ;junp if y is not aligned to a 32b boundary 
test rdx,1fh 

jnz Error ;junp if x is not aligned to a 32b boundary 
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; Initialize 
xor eax,eax ;set array offset to zero 
vbroadcastsd ymm5,real8 ptr [r8 2po] ;packed 2.0 


; Simple array processing loop 
align 16 
@@: vmovapd ymmO,ymmword ptr [rdx+rax] ;load x[i«3]:x[i] 
vdivpd ymmi,ymmo,ymm5 
vsqrtpd ymm2,ymm1 
vmovapd ymmword ptr [rcx+rax],ymm2 ;save y[i«3]:y[i] 


vmovapd ymmO,ymmword ptr [rdx+rax+32]  ;load x[i+7]:x[i+4] 
vdivpd ymmi,ymmo,ymm5 

vsqrtpd ymm2,ymm1i 

vmovapd ymmword ptr [rcx+rax+32],ymm2 save y[i+7]:y[i+4] 


; À backward conditional jump is used in this code block since 
; the fall-through case is less likely to occur 

add rax,64 

sub r8,8 

jnz GB 


Done: xor eax,eax ;set success return code 
vzeroupper 
ret 


; Error handling code that's unlikely to execute 


Error: mov eax,1 ;set error return code 
ret 

CalcResult endp 
end 


Data Alignment 


It's been mentioned (perhaps excessively) multiple times in this book, but the importance of using properly 
aligned data cannot be over emphasized. Programs that manipulate improperly aligned data are likely 

to trigger the processor into performing additional memory cycles and micro-op executions, which can 
adversely affect overall system performance. The following data alignment practices should be considered 
universal truths and always observed: 


e Align multi-byte integer and floating-point values to their natural boundaries. 


e = Align 128-, 256-, and 512-bit wide packed integer and floating-point values to their 
proper boundaries. 


e Pad data structures if necessary to ensure proper alignment of each structure 
member. 


e Use the appropriate C++ language specifiers and library functions to align data items 
that are allocated in high-level code. Visual C++ functions can use the alignas(n) 
specifier or call aligned malloc to properly align data items. 


e Give preference to aligned stores over aligned loads. 
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The following data arrangement techniques are also recommended: 


Align and position small arrays and short text strings in a data structure to avoid cache 
line splits. A cache line split occurs when the bytes of a multi-byte value are split 
across a 64-byte boundary. Positioning small multi-byte values on the same cache 
line helps minimize the number of memory cycles that the processor must perform. 


Evaluate the performance effects of different data layouts such as structure of arrays 
versus array of structures. 


SIMD Techniques 


The following techniques should be observed, when appropriate, by any function that performs SIMD 
computations using AVX, AVX2, or AVX-512 instructions. 


Do not code functions that intermix x86-AVX and x86-SSE instructions. It is okay to 
code functions that intermix AVX, AVX2, and AVX-512 instructions. 


Minimize register dependencies to exploit multiple Execution Units in the Execution 
Engine. 


Load multiple-use memory operands and packed constants into a register. 


On systems that support AVX-512, exploit the extra SIMD registers to minimize 

data dependencies and register spills. A register spill occurs when a function must 
temporarily save the contents of a register to memory in order to free the register for 
other calculations. 


Use a vpxor, vxorp[d |s], etc. instruction to zero a register instead of a data 
move instruction. For example, vxorps xmmO,xmmO is preferred over vmovaps 
xmmO,xmmword ptr [XmmZero]. 


Use x86-AVX masking and Boolean operations to minimize or eliminate data- 
dependent conditional jump instructions. 


Perform packed data loads and stores using the aligned move instructions (e.g., 
vmovdga, vmovap[d|s], etc.). 


Process SIMD arrays using small data blocks to maximize reuse of resident cache data. 


Use the vzeroupper instruction when required to avoid x86-AVX to x86-SSE state 
transition penalties. 


Use the doubleword forms of the gather and scatter instructions instead of the 
quadword forms when possible (e.g., use vgatherdp[d|s] and doubleword indices 
instead of vgatherqp[d|s] and quadword indices). Perform any required gather 
operations well ahead of when the data is needed. 


The following practices can also be employed to improve the performance of certain algorithms that 
perform SIMD encoding and decoding operations: 


Use the non-temporal store instructions (e.g., vnovntdqa, vmovntp[d|s], etc.) to 
minimize cache pollution. 


Use the data prefetch instructions (e.g., prefetchto, prefetchnta, etc.) to notify the 
processor of expected-use data items. 


Chapter 16 contains a couple of source code examples that illustrate how to use the non-temporal store 
and data prefetch instructions. 
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Summary 


Here are the key learning points for Chapter 15: 
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The performance of most assembly language functions can be improved by 
implementing the optimization strategies and techniques outlined in this chapter. 


The recommended optimization techniques must be judiciously applied. It is 
not uncommon to encounter coding situations where a recommend strategy or 
technique is not the best approach. 


To achieve optimal performance for a specific algorithm or function, it may be 
necessary to code multiple versions and compare benchmark timing measurements. 


When developing assembly language code, don't spend an excessive amount of time 
trying to maximize performance. Focus on performance gains that are relatively easy 
to attain (e.g., implementing an algorithm using SIMD instead of scalar arithmetic). 


None of the optimization strategies and techniques presented in this chapter will 
ameliorate an inappropriate or poorly designed algorithm. 


CHAPTER 16 


Advanced Programming 


The final chapter of this book reviews several source code examples that demonstrate advanced x86 
assembly language programming techniques. The first example explains how to use the cpuid instruction 

to detect specific x86 instruction set extensions. This is followed by two examples that illustrate how to 
accelerate SIMD processing functions using non-temporal memory stores and data prefetch instructions. 
The concluding example elucidates the use of an assembly language calculating function in a multithreaded 
application. 


CPUID Instruction 


It's been mentioned several times in this book that an application program should never assume that 
a specific instruction set extension such as AVX, AVX2, or AVX-512 is available simply by knowing the 
processor's microarchitecture, model number, or brand name. An application program should always test 
for the presence of an instruction set extension using the cupid (CPU Identification) instruction. Application 
programs can use this instruction to verify that a processor supports one of the previously-mentioned 
x86-AVX instruction set extensions. The cpuid instruction can also be used to obtain additional processor 
feature information that's useful or needed in both application programs and operating system software. 
Listing 16-1 shows the source code for example Ch16 01. This example demonstrates how to use the 
cupid instruction to determine processor support for various instruction set extensions. The source code 
for example Ch16 01 focuses on using cpuid to detect architectural features and instruction set extensions 
that are allied with this book's content. If you're interested in learning how to use the cpuid instruction to 
identify other processor features, you should consult the AMD and Intel reference manuals that are listed in 
Appendix A. 


Listing 16-1. Example Ch16 01 


#pragma once 

#include <cstdint> 
#include <vector> 
#include <string> 
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struct CpuidRegs 


{ 
uint32 t EAX; 
uint32 t EBX; 
uint32 t ECX; 
uint32 t EDX; 
n 
class CpuidInfo 
{ 
public: 
class CacheInfo 
{ 
public: 
enum class Type 
{ 
Unknown, Data, Instruction, Unified 
n 
private: 
uint32 t m Level - 0; 
Type m Type = Type: :Unknown; 
uint32 t m Size - 0; 
public: 
uint32 t GetLevel(void) const { return m Level; } 
uint32 t GetSize(void) const { return m Size; } 
Type GetType(void) const { return m Type; } 
// These are defined in CacheInfo.cpp 
CacheInfo(uint32 t level, uint32 t type, uint32 t size); 
std::string GetTypeString(void) const; 
n 
private: 
uint32 t m MaxEax; // Max EAX for basic CPUID 
uint32 t m MaxEaxExt; // Max EAX for extended CPUID 
uint64 t m FeatureFlags; // Processor feature flags 
std::vector«CpuidInfo::CacheInfo» m CacheInfo; // Processor cache information 
char m VendorId[13]; // Processor vendor ID string 
char m ProcessorBrand[49]; // Processor brand string 
bool m OsXsave; // XSAVE is enabled for app use 
bool m OsAvxState; // AVX state is enabled by OS 
bool m OsAvx512State; // AVX-512 state is enabled by OS 
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void Init(void); 

void InitProcessorBrand(void); 
void LoadInfoO(void); 

void LoadInfo1(void) ; 


void LoadInfo2(void); 
void LoadInfo3(void); 
void LoadInfo4(void); 
void LoadInfo5(void) ; 


public: 

enum class FF : uint64 t 

{ 
FXSR 
MMX 
MOVBE 
SSE 
SSE2 
SSE3 
SSSE3 
SSE4 1 
SSE4 2 
PCLMULODO 
POPCNT 
PREFETCHW 
PREFETCHWT1 
RDRAND 
RDSEED 
ERMSB 
AVX 
AVX2 
F16C 
FMA 
BMI1 
BMI2 
LZCNT 
ADX 
AVX512F 
AVX512ER 
AVX512PF 
AVX512D0 
AVX512CD 
AVX512BW 
AVX512VL 
AVX512 IFMA 
AVX512 VBMI 
AVX512_4FMAPS 
AVX512_4VNNIW 
AVX512 VPOPCNTDO 
AVX512 VNNI 
AVX512 VBMI2 
AVX512 BITALG 
CLWB 


E 
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(uint64 t)1 
(uint64 t)1 
(uint64 t)1 
(uint64 t)1 
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(uint64 t)1 
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CpuidInfo(void) { Init(); }; 
"CpuidInfo() {}; 


const std::vector«CpuidInfo::CacheInfo»& GetCacheInfo(void) const 


{ 


return m CacheInfo; 


} 
bool GetFF(FF flag) const 
{ 
return ((m FeatureFlags & (uint64 t)flag) != 0) ? true : false; 
} 


std::string GetProcessorBrand(void) const { return std::string(m ProcessorBrand); } 
std::string GetVendorId(void) const { return std::string(m VendorId); } 


void LoadInfo(void); 


B 


// Cpuinfo_.asm 
extern "C" void Xgetbv (uint32 t r ecx, uint32 t* r eax, uint32 t* r edx); 
extern "C" uint32 t Cpuid (uint32 t r eax, uint32 t r ecx, CpuidRegs* r out); 


; The following structures must agree with the CpuidRegs structure 
; that's defined in CpuidInfo.h 


CpuidRegs struct 


RegEAX dword ? 
RegEBX dword ? 
RegECX dword ? 
RegEDX dword ? 


CpuidRegs ends 


extern "C" uint32 t Cpuid (uint32 t r eax, uint32 t r ecx, CpuidRegs* r out); 


Unsupported CPUID leaf 


Returns: eax -- 
! Supported CPUID leaf 


eax ! 


we Le Le be Le ve 


Note: the return code is valid only if r eax <= MaxEAX. 


.code 
Cpuid proc frame 
push rbx 
.pushreg rbx 
.endprolog 
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; Load eax and ecx 
mov eax,ecx 
mov ecx,edx 


; Get cpuid info & save results 
cpuid 
mov dword ptr [r8+CpuidRegs.RegEAX],eax 
mov dword ptr [r8+CpuidRegs.RegEBX],ebx 
mov dword ptr [r8+CpuidRegs.RegECX],ecx 
mov dword ptr [r8+CpuidRegs.RegEDX],edx 


; Test for unsupported CPUID leaf 
Or eax,ebx 
Or ecx,edx 
or eax,ecx ;eax = return code 


pop rbx 
ret 
Cpuid endp 


; extern "C" void Xgetbv (uint32 t r ecx, uint32 t* r eax, uint32 t* r edx); 


Xgetbv proc 


mov r9,rdx ;Y9 = r eax ptr 
xgetbv 
mov dword ptr [r9],eax ;save low word result 
mov dword ptr [r8],edx ;save high word result 
ret 
Xgetbv endp 
end 
[l------------------------------------------------ 
// Ch16_01.cpp 
[l------------------------------------------------ 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 
#include <string> 
#include "CpuidInfo.h" 


using namespace std; 


static void DisplayCacheInfo(const CpuidInfo& ci); 
static void DisplayFeatureFlags(const CpuidInfo& ci); 
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int main() 


{ 
CpuidInfo ci; 


ci.LoadInfo(); 


cout << ci.GetVendorId() << ‘\n'; 
cout << ci.GetProcessorBrand() << ‘\n'; 


DisplayCacheInfo(ci); 
DisplayFeatureFlags(ci); 
return 0; 


) 


static void DisplayCacheInfo(const CpuidInfo& ci) 
( 


const vector«CpuidInfo::CacheInfo»& cache info 


for (const CpuidInfo::CacheInfo& x : cache info) 


{ 
uint32 t cache size = x.GetSize(); 
string cache size str; 


if (cache size < 1024 * 1024) 
{ 
cache size /= 1024; 
cache size str = "KB"; 
) 
else 
{ 
cache size /= 1024 * 1024; 
cache size str = "MB"; 


) 


cout << "Cache L" << x.GetLevel() << ": "; 
cout «« cache size «« cache size str «« ' 
cout << x.GetTypeString() << 'Mn'; 


} 


R 


static void DisplayFeatureFlags(const CpuidInfo& ci) 


{ 


const char nl = ‘\n'; 


cout << "----- CPUID Feature Flags ----- "<< nl; 
cout «« "ADX: " «« ci.GetFF(CpuidInfo:: 
cout «« "AVX: " «« ci.GetFF(CpuidInfo:: 
cout «« "AVX2: " «« ci.GetFF(CpuidInfo:: 
cout << "AVX512F: " << ci.GetFF(CpuidInfo:: 
cout << "AVX512BW: " << ci.GetFF(CpuidInfo:: 
cout «« "AVX512CD: " «« ci.GetFF(CpuidInfo:: 
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FF: 
FF: 
FF: 
FF: 
FF: 
FF: 


ci.GetCacheInfo(); 


:ADX) << nl; 
:AVX) << nl; 
:AVX2) << nl; 
:AVX512F) << nl; 
:AVX512BW) << nl; 
:AVX512CD) «« nl; 


cout << "AVX512D0: "€ ucl 
cout << "AVX512ER: " «€ ci 
cout «« "AVX512PF: " << ci 
cout << "AVX512VL: "<< ci 
cout «« "AVX512 IFMA: " «« ci 
cout «« "AVX512 VBMI: " «« ci 
cout «« "BMI1: T € CL 
cout «« "BMI2: "<< ci 
cout «« "F16C: FC CL 
cout << "FMA: TAE CI 
cout «« "LZCNT: " << ci 
cout «« "POPCNT: Moore 

} 

[l------------------------------------------------ 

// CpuidInfo.cpp 

[l------------------------------------------------ 

#include "stdafx.h" 

#include <string> 

#include <cstring> 

#include <vector> 

#include "CpuidInfo.h" 


using namespace std; 


void CpuidInfo::LoadInfo(void) 

1 
// Note: 
LoadInfoo(); 
LoadInfo1(); 
LoadInfo2(); 
LoadInfo3(); 
LoadInfo4(); 
LoadInfo5(); 

} 


void CpuidInfo: : LoadInfo0(void) 


{ 
CpuidRegs r1; 


.GetFF(CpuidInfo: 
i.GetFF(CpuidInfo: 
.GetFF(CpuidInfo: 
i.GetFF(CpuidInfo: 
.GetFF(CpuidInfo: 
.GetFF(CpuidInfo: 
.GetFF(CpuidInfo: 
.GetFF(CpuidInfo: 
.GetFF(CpuidInfo: 
.GetFF(CpuidInfo: 
.GetFF(CpuidInfo: 
.GetFF(CpuidInfo: 


LoadInfoO must be called first 


// Perform required initializations 


Init(); 


// Get MaxEax and VendorID 
Cpuid (0, 0, 8r1); 

m MaxEax = r1.EAX; 

*(uint32 t *)(m VendorId + 0) 
*(uint32 t *)(m VendorId + 4) 
*(uint32 t *)(m VendorId + 8) 
m VendorId[sizeof(m VendorId) 
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FF: 
FF; 
:FF: 
FR: 
IFF: 
:FF: 
FF: 
FF: 
FF: 
FF: 
:FF: 
SEP: 


:AVX512DO) «« 
:AVX512ER) << 
:AVX512PF) << 
:AVX512VL) << 
:AVX512 IFMA) 
:AVX512 VBMI) 
:BMI1) << nl; 
:BMI2) << nl; 
:F16C) << nl; 
:FMA) << nl; 

:LZCNT) << nl; 
:POPCNT) << nl; 
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// Get MaxEaxExt 
Cpuid (0x80000000, 0, &r1); 
m MaxEaxExt = r1.EAX; 


// Initialize processor brand string 
InitProcessorBrand(); 


} 


void CpuidInfo::LoadInfoi(void) 


{ 
CpuidRegs r; 


if (m MaxEax « 1) 
return; 


Cpuid (1, 0, &r); 


// 
// Decode r.ECX flags 
// 


// CPUID.(EAX=01H, ECX=00H):ECX.SSE3[bit 0] 
if (r.ECX & (0x1 << 0)) 
m FeatureFlags |= (uint64 t)FF::SSE3; 


// CPUID.(EAX=01H, ECX-O0H):ECX.PCLMULODO[bit 1] 
if (r.ECX & (0x1 << 1)) 
m FeatureFlags |= (uint64 t)FF::PCLMULODO; 


// CPUID.(EAX=01H, ECX=00H):ECX.SSSE3[bit 9] 
if (r.ECX & (0x1 << 9)) 
m FeatureFlags |= (uint64 t)FF::SSSE3; 


// CPUID.(EAX=01H, ECX=00H):ECX.SSE4.1[bit 19] 
if (r.ECX & (0x1 << 19)) 
m FeatureFlags |= (uint64 t)FF::SSE4 1; 


// CPUID.(EAX-01H, ECX-00H):ECX.SSE4.2[bit 20] 
if (r.ECX & (0x1 << 20)) 
m FeatureFlags |= (uint64 t)FF::SSE4 2; 


// CPUID.(EAX-01H, ECX=00H):ECX.MOVBE[bit 22] 
if (r.ECX & (0x1 << 22)) 
m FeatureFlags |- (uint64 t)FF::MOVBE; 


// CPUID.(EAX=01H, ECX=00H):ECX.POPCNT[bit 23] 
if (r.ECX & (0x1 << 23)) 
m FeatureFlags |= (uint64 t)FF::POPCNT; 


// CPUID.(EAX=01H, ECX=00H):ECX.RDRAND[bit 30] 
if (r.ECX & (0x1 << 30)) 
m FeatureFlags |= (uint64 t)FF::RDRAND; 
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// 
// Decode r.RDX flags 
// 


// CPUID.(EAX=01H, ECX=00H):EDX.MMX[bit 23] 
if (r.EDX & (0x1 << 23)) 
m FeatureFlags |= (uint64 t)FF::MMX; 


// CPUID.(EAX-01H, ECX-00H):EDX.FXSR[bit 24] 
if (r.EDX & (0x1 << 24)) 
m FeatureFlags |= (uint64 t)FF::FXSR; 


// CPUID.(EAX=01H, ECX=00H):EDX.SSE[bit 25] 
if (r.EDX & (0x1 << 25)) 
m FeatureFlags |= (uint64 t)FF::SSE; 


// CPUID.(EAX=01H, ECX=00H):EDX.SSE2[bit 26] 
if (r.EDX & (0x1 << 26)) 
m FeatureFlags |= (uint64 t)FF::SSE2; 
j 


void CpuidInfo::LoadInfo2(void) 


{ 
CpuidRegs  r; 


if (m MaxEax « 7) 
return; 


Cpuid (7, 0, 8r); 


// CPUID.(EAX-07H, ECX=00H):ECX.PREFETCHWT1[bit 0] 
if (r.ECX & (0x1 «« 0)) 
m FeatureFlags |= (uint64 t)FF::PREFETCHWT1; 


// CPUID.(EAX-07H, ECX=00H):EBX.BMI1[bit 3] 
if (r.EBX & (0x1 << 3)) 
m FeatureFlags |= (uint64 t)FF::BMI1; 


// CPUID.(EAX-07H, ECX-00H):EBX.BMI2[bit 8] 
if (r.EBX & (0x1 << 8)) 
m FeatureFlags |= (uint64 t)FF::BMI2; 


// CPUID.(EAX-07H, ECX=00H):EBX.ERMSB[bit 9] 
// ERMSB = Enhanced REP MOVSB/STOSB 
if (r.EBX & (0x1 «« 9)) 

m FeatureFlags |= (uint64 t)FF::ERMSB; 


// CPUID.(EAX-07H, ECX-00H):EBX.RDSEED[bit 18] 


if (r.EBX & (0x1 << 18)) 
m FeatureFlags |= (uint64 t)FF::RDSEED; 


549 


CHAPTER 16 ™ ADVANCED PROGRAMMING 


} 


// CPUID.(EAX=07H, ECX=00H):EBX.ADX[bit 19] 
if (r.EBX & (0x1 << 19)) 
m FeatureFlags |= (uint64 t)FF::ADX; 


// CPUID.(EAX-07H, ECX=00H):EBX.CLWB[bit 24] 
if (r.EBX & (0x1 << 24)) 
m FeatureFlags |- (uint64 t)FF::CLWB; 


void CpuidInfo::LoadInfo3(void) 


{ 


} 


CpuidRegs r; 


if (m MaxEaxExt « 0x80000001) 
return; 


Cpuid (0x80000001, 0, &r); 


// CPUID.(EAX-80000001H, ECX=00H):ECX.LZCNT[bit 5] 
if (r.ECX & (0x1 << 5)) 
m FeatureFlags |= (uint64 t)FF::LZCNT; 


// CPUID.(EAX-80000001H, ECX-00H):ECX.PREFETCHW[bit 8] 
if (r.ECX & (0x1 << 8)) 
m FeatureFlags |= (uint64 t)FF::PREFETCHW; 


void CpuidInfo::LoadInfo4(void) 


{ 
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CpuidRegs r_eax01h; 
CpuidRegs r_eax07h; 


if (m MaxEax « 7) 
return; 


Cpuid (1, 0, &r eax01h); 
Cpuid (7, 0, &r eax07h); 


// Test CPUID.(EAX=01H, ECX-00H):ECX.OSXSAVE[bit 27] to verify use of XGETBV 
m OsXsave = (r eaxO1h.ECX & (0x1 << 27)) ? true : false; 


if (m OsXsave) 
( 
// Use XGETBV to obtain following information 
// AVX state is enabled by OS if (XCRo[2:1] == '11b') is true 
// AVX512 state is enabled by OS if (XCRO[7:5] == '111b') is true 


uint32 t xgetbv eax, xgetbv edx; 


Xgetbv (0, &xgetbv_eax, &xgetbv edx); 
m OsAvxState = (((xgetbv eax >> 1) & 0x03) == 0x03) ? true : false; 
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if (m OsAvxState) 

{ 
// CPUID.(EAX-01H, ECX=00H):ECX.AVX[bit 28] 
if (r_eax01h.ECX & (0x1 << 28)) 


( 
m FeatureFlags |= (uint64 t)FF::AVX; 


// CPUID.(EAX-01H, ECX=00H):ECX.FMA[bit 12] 
if (r_eaxo1h.ECX & (0x1 << 12)) 
m FeatureFlags |= (uint64 t)FF::FMA; 


// CPUID.(EAX=01H, ECX=00H):ECX.F16C[bit 29] 
if (r_eaxo1h.ECX & (0x1 << 29)) 
m FeatureFlags |= (uint64 t)FF::F16C; 


// CPUID.(EAX-07H, ECX=00H):EBX.AVX2[bit 5] 
if (r eaxo7h.EBX & (0x1 << 5)) 
m FeatureFlags |= (uint64 t)FF::AVX2; 


m OsAvx5i2State = (((xgetbv eax >> 5) & 0x07) == 0x07) ? true : false; 


if (m OsAvx512State) 

( 
// CPUID.(EAX=07H, ECX=00H):EBX.AVX512F[bit 16] 
if (r eaxO07h.EBX & (0x1 << 16)) 


( 
m FeatureFlags |= (uint64 t)FF::AVX512F; 
// 
// Decode EBX flags 
// 


// CPUID. (EAX=07H, ECX=00H) :EBX.AVX512D0[bit 17] 
if (r eax07h.EBX & (0x1 << 17)) 
m FeatureFlags |= (uint64 t)FF::AVX512D0; 


// CPUID.(EAX-07H, ECX=00H):EBX.AVX512 IFMA[bit 21] 
if (r eaxO07h.EBX & (0x1 << 21)) 
m FeatureFlags |- (uint64 t)FF::AVX512 IFMA; 


// CPUID.(EAX-07H, ECX=00H):EBX.AVX512PF[bit 26] 
if (r eaxO07h.EBX & (0x1 << 26)) 
m FeatureFlags |- (uint64 t)FF::AVX512PF; 


// CPUID.(EAX-07H, ECX=00H):EBX.AVX512ER[bit 27] 
if (r eax07h.EBX & (0x1 << 27)) 
m FeatureFlags |= (uint64 t)FF::AVX512ER; 


// CPUID.(EAX=07H, ECX=00H):EBX.AVX512CD[bit 28] 


if (r eaxO07h.EBX & (0x1 << 28)) 
m FeatureFlags |= (uint64 t)FF::AVX512CD; 
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// CPUID.(EAX-07H, ECX=00H):EBX.AVX512BW[bit 30] 
if (r eaxO07h.EBX & (0x1 << 30)) 
m FeatureFlags |= (uint64 t)FF::AVX512BW; 


// CPUID.(EAX-07H, ECX=00H):EBX.AVX512VL[bit 31] 
if (r eaxO07h.EBX & (0x1 << 31)) 
m FeatureFlags |= (uint64 t)FF::AVX512VL; 


// 
// Decode ECX flags 
// 


// CPUID.(EAX-07H, ECX=00H):ECX.AVX512 VBMI[bit 1] 
if (r_eaxo7h.ECX & (0x1 << 1)) 
m FeatureFlags |- (uint64 t)FF::AVX512 VBMI; 


// CPUID.(EAX=07H, ECX=00H):ECX.AVX512 VBMI2[bit 6] 
if (r eaxo7h.ECX & (0x1 << 6)) 
m FeatureFlags |- (uint64 t)FF::AVX512 VBMI2; 


// CPUID.(EAX-07H, ECX=00H):ECX.AVX512 VNNI[bit 11] 
if (r eaxO07h.ECX & (0x1 << 11)) 
m FeatureFlags |- (uint64 t)FF::AVX512 VNNI; 


// CPUID.(EAX-07H, ECX-00H):ECX.AVX512 BITALG[bit 12] 
if (r eaxO07h.ECX & (0x1 << 12)) 
m FeatureFlags |= (uint64 t)FF::AVX512 BITALG; 


// CPUID.(EAX-07H, ECX=00H):ECX.AVX512 VPOPCNTDO[bit 14] 
if (r eax07h.ECX & (0x1 << 14)) 
m FeatureFlags |- (uint64 t)FF::AVX512 VPOPCNTDO; 


// 
// Decode EDX flags 
// 


// CPUID.(EAX-07H, ECX=00H):EDX.AVX512 4FMAPS[bit 2] 
if (r eaxo7h.EDX & (0x1 << 2)) 
m FeatureFlags |= (uint64 t)FF::AVX512 4FMAPS; 


// CPUID.(EAX-07H, ECX-00H):EDX.AVX512 AVNNIW[bit 3] 


if (r eaxo7h.EDX & (0x1 << 3)) 
m FeatureFlags |= (uint64 t)FF::AVX512 AVNNIM; 
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void CpuidInfo::LoadInfos (void) 


{ 


} 


if (m MaxEax « 4) 
return; 


bool done - false; 
uint32 t index - 0; 


while (!done) 


{ 
CpuidRegs r; 
Cpuid (4, index, &r); 
uint32 t cache type = r.EAX & Ox1f; 
uint32 t cache level = ((r.EAX >> 5) & 0x3); 
if (cache type -- 0) 
done - true; 
else 
{ 
uint32_t ways = ((r.EBX >> 22) & Ox3ff) + 1; 
uint32_t partitions = ((r.EBX >> 12) & Ox3ff) + 1; 
uint32_t line_size = (r.EBX & oxfff) + 1; 
uint32_t sets = r.ECX + 1; 
uint32_t cache size = ways * partitions * line size * sets; 
CacheInfo ci(cache level, cache type, cache size); 
m CacheInfo.push back(ci); 
index++; 
} 
} 


void CpuidInfo::Init(void) 


{ 


m_MaxEax = 0; 

m MaxEaxExt = 0; 

m FeatureFlags - 0; 

m OsXsave - false; 

m OsAvxState - false; 

m OsAvx5i2State = false; 
m VendorId[0] = ‘\o'; 
m ProcessorBrand[0] = 
m CacheInfo.clear(); 


"As 
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void CpuidInfo::InitProcessorBrand(void) 


( 

if (m MaxEaxExt »- 0x80000004) 

{ 
CpuidRegs r2, r3, r4; 
char* p = m ProcessorBrand; 
Cpuid (0x80000002, 0, 8r2); 
Cpuid (0x80000003, 0, 8r3); 
Cpuid (0x80000004, 0, &r4); 
*(uinta2 t *)(p + 0) = r2.EAX; 
*(uint32 t *)(p + 4) = r2.EBX; 
*(uint32 t *)(p + 8) = r2.ECX; 
*(uint32_t *)(p + 12) = r2.EDX; 
*(uint32 t *)(p + 16) = r3.EAX; 
*(uint32 t *)(p + 20) = r3.EBX; 
*(uint32 t *)(p + 24) = r3.ECX; 
*(uint32 t *)(p + 28) = r3.EDX; 
*(uint32 t *)(p + 32) = r4.EAX; 
*(uint32 t *)(p + 36) = r4.EBX; 
*(uint32 t *)(p + 40) = r4.ECX; 
*(uint32 t *)(p + 44) = r4.EDX; 
m ProcessorBrand[sizeof(m ProcessorBrand) - 1] = '\o'; 

) 

else 
strcpy s(m ProcessorBrand, "Unknown"); 

) 


Before examining the source code, it will be helpful to have a basic understanding of the cupid 
instruction and how it works. Prior to using cpuid, a function must load register EAX with a "leaf" value that 
specifies what information the cpuid instruction should return. A second or "sub-leaf" value may also be 
required in register ECX. The cpuid instruction returns its results in registers EAX, EBX, ECX, and EDX. The 
calling function must then decode the values in these registers to ascertain processor support for specific 
features. As you will soon see, it is often necessary for a program to employ the cupid instruction multiple 
times. Most application programs typically use cupid during initialization and save the results for later use. 
The reason for this is that cupid is a serializing instruction, which means that it forces the processor to finish 
executing all previously fetched instructions and perform any pending memory writes before fetching the 
next instruction. In other words, the cupid instruction takes a long time to complete its execution. 

Listing 16-1 begins with the header file CpuidInfo.h. Near the top of this file is a structure named 
CpuidRegs, which is used to save the results returned by cupid. Following CpuidRegs is a C++ class named 
CpuidInfo. This class contains the code and data that's associated with cpuid instruction use. The public 
portion of CpuidInfo includes a subclass named CacheInfo. This class is employed to report information 
about a processor's memory caches. Class CpuidInfo also includes an enum named FF. An application 
program can use this enum as an argument value with the member function CpuidInfo::GetFF to determine 
if the host processor supports a specific instruction set. You'll see how this works later in this section. Toward 
the bottom of header file CpuidInfo.h are two declaration statements for the assembly language functions 
Cpuid and Xgetbv_. These functions execute the cpuid and xgetbv (Get Value of Extended Control 
Register) instructions, respectively. 
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Following CpuidInfo.h in Listing 16-1 is the source code file CpuidInfo_.asm. This file contains 
the assembly language functions Cpuid andXgetbv , which are simple wrapper functions for the x86 
instructions cupid and xgetbv. The function Cpuid begins its execution by saving register RBX on the stack. 
It then loads argument values r eax and r_ecx into registers EAX and ECX. The actual cupid instruction 
follows the loading of registers EAX and ECX. Following the execution of cpuid, the results in registers EAX, 
EBX, ECX, and EDX are saved to the specified CpuidRegs structure. The assembly language function Xgetbv 
executes the xgetbv instruction. This instruction loads the contents of the extended processor control 
register that's specified by ECX into register pair EDX:EAX. The xgetbv instruction allows an application 
program to determine if the host operating system supports AVX, AVX2, or AVX-512, as explained later in this 
section. 

The next file in Listing 16-1 is Ch16 01.cpp. The function main contains code illustrates how to use the 
C++ class CpuidInfo. The statement ci.LoadInfo() invokes the member function CpuidInfo::LoadInfo, 
which generates multiple executions of cpuid to obtain information about the processor. Note that 
CpuidInfo: : LoadInfo is only called once. The function DisplayCacheInfo streams information 
about the processor's memory caches to cout. This function invokes CpuidInfo: :GetCacheInfo to 
report cache information that was obtained during execution of CpuidInfo: : LoadInfo. The function 
DisplayFeatureFlags shows information about some of the instruction set extensions that the processor 
supports. Each cout statement in this function uses CpuidInfo::GetFF with a different CpuidInfo::FF value. 
The member function CpuidInfo::GetFF returns a single bool value that indicates whether the processor 
supports instruction set extension that's specified by its argument value. Like the cache data, the processor 
instruction set extension information was obtained and saved during the call to CpuidInfo: : LoadInfo. 
Note that CpuidInfo is structured to allow an application program to make multiple CpuidInfo: :GetFF calls 
without triggering additional executions of cupid. 

Following the file Ch16 01.cpp in Listing 16-1 is the source code for CpuidInfo.cpp, which contains the 
non-trivial member functions for class CpuidInfo. The member function CpuidInfo::LoadInfo that was 
discussed earlier invokes six private member functions that perform a multitude of cupid queries. The first 
of these functions, CpuidInfo: : LoadInfo0, begins its execution by calling CpuidInfo: : Init to carry out 
the requisite initializations. It then invokes the assembly language function Cpuid_ to obtain the maximum 
cpuid leaf value that’s supported by the processor and the processor vendor ID string. Another call to Cpuid_ 
is then used to obtain the maximum leaf value for extended cupid information. This is followed by a call to 
CpuidInfo: : InitProcessorBrand, which uses several Cpuid_ calls to query and save the processor brand 
string. The source code for this function is located toward the end of the file CpuidInfo. cpp. 

The member functions CpuidInfo: : LoadInfo1, CpuidInfo: : LoadInfo2, and CpuidInfo: : LoadInfo3 
also exploit Cpuid_ to ascertain processor support for a variety of instruction set extensions. The code that’s 
contained in these member functions is mostly brute-force decoding of the various Cpuid results. The AMD 
and Intel programming reference manuals contain additional information about the cpuid feature flag bits 
that are used to indicate processor support for a specific instruction set extension. The private member 
function CpuidInfo: : LoadInfo4 contains the code that checks for AVX, AVX2, and AVX-512. This member 
function warrants closer examination. 

An application program can use the computational resources of x86-AVX only if it’s supported by 
both the processor and its host operating system. The Xgetbv_ function can be employed to determine 
host operating system support. Before using Xgetbv_, the cpuid flag OSXSAVE must be tested to ensure that 
it’s safe for an application program to use the xgetbv instruction. If OSXSAVE is set to true, the function 
CpuidInfo: : LoadInfo4 invokes Xgetbv to obtain information regarding OS support for x86-AVX state 
information (i.e., whether the OS properly preserves the XMM, YMM, and ZMM registers during a task 
switch). When using the Xgetbv_ function, the processor will generate an exception if the extended control 
register number is invalid or if the processor’s OSXSAVE flag is set to false. This explains why the software 
flag m_OsXsave is checked in prior to calling Xgetbv_. If the host operating system supports x86-AVX state 
information, the function CpuidInfo: : LoadInfo4 proceeds to decode the cpuid feature flags related 
to AVX and AVX2. Note that the feature flags FMA and F16C are also tested here. The remaining code 
in CpuidInfo: : LoadInfo4 decodes the cupid feature flags that signify support for the various AVX-512 
instruction set extensions. 
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The final private member function that's called by CpuidInfo: : LoadInfo is named CpuidInfo: : LoadInfo5. 
This member function uses cpuid and the class CpuidInfo: :CacheInfo to save type and size information about 
the processor's memory caches. The ancillary code for class CpuidInfo::CacheInfo is not shown in Listing 16-1 
but included with the chapter download package. Here is the output for source code example Ch16 01: 


GenuineIntel 

Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz 
Cache L1: 32KB Data 

Cache L1: 32KB Instruction 

Cache L2: 1MB Unified 

Cache L3: 13MB Unified 

----- CPUID Feature Flags ----- 

ADX: 1 


AVX512F: 
AVX512BW: 
AVX512CD: 
AVX512D0: 
AVX512ER: 
AVX512PF: 
AVX512VL: 
AVX512 IFMA: 
AVX512 VBMI: 
BMI1: 

BMI2: 

F16C: 

FMA: 

LZCNT: 
POPCNT: 


PRPRPPPRPOOHRCORBRBRHBEBEB 


Table 16-1 shows a summary of cpuid information for several Intel processors. Before moving on to the 
next source code example, it should be noted that when using the cpuid instruction to determine processor 
support for the various AVX-512 instruction set extensions, it is often necessary for an application program 
to test multiple feature flags. For example, an application program must verify that the AVX512F, AVX512DQ, 
and AVX512VL feature flags are all set before using any AVX512DQ instructions with 256-bit or 128-bit wide 
operands. The Intel 64 and IA-32 Architectures Software Developer’s Manual (Volume 1) contains additional 
information regarding cpuid instruction use and feature flag testing. 
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Table 16-1. Summary of Information from cpuid Instruction for Select Intel Processors 
CPUID Feature i3-2310m i7-4790s i9-7900x i7-8700k 


L1 Data (KB, per core) 32 32 32 32 
L1 Instruction (KB, per core) 32 32 32 32 
L2 Unified (KB, per core) 256 256 1024 256 
L3 Unified (MB) 
ADX 

AVX 

AVX2 
AXV512F 
AVX512BW 
AVX512CD 
AVX512DQ 
AVX512bR 
AVX512PF 
AVX512VL 
AVX512 IFMA 
AVX512  VBMI 
BMII 

BMI2 

F16C 

FMA 

LZCNT 
POPCNT 
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Non-Temporal Memory Stores 


From the perspective of a memory cache, data can be classified as either temporal or non-temporal. 
Temporal data is any value that is accessed more than once within a short period of time. Examples of 
temporal data include the elements of an array or data structure that are referenced multiple times during 
execution of a program loop. It also includes the instruction bytes of a program. Non-temporal data is any 
value that is accessed once and not immediately reused. The destination arrays of many SIMD processing 
algorithms often contain non-temporal data. The differentiation between temporal and non-temporal data 
is important since processor performance often degrades if its memory caches contain excessive amounts 
of non-temporal data. This condition is commonly called cache pollution. Ideally, a processor's memory 
caches contain only temporal data since it makes little sense to cache items that are only used once. 

Listing 16-2 shows the source code for example Ch16 02. This example illustrates the use of the 
non-temporal store instruction vmovntps. It also compares the performance of this instruction to the 
standard vmovaps instruction. 
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Listing 16-2. Example Ch16_02 


#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 
#include <string> 
#include <random> 
#include "Ch16 02.h" 
#include "AlignedMem.h" 


using namespace std; 


void Init(float* x, size t n, unsigned int seed) 


{ 
uniform int distribution«» ui dist (1, 1000}; 
default random engine rng {seed}; 
for (size t i = 0; i< n; i++) 
x[i] = (float)ui dist(rng); 
i 


bool CalcResultCpp(float* c, const float* a, const float* b, size t n) 
{ 


size t align = 32; 


if ((n == 0) || ((n & oxof) != 0)) 
return false; 


if (!AlignedMem::IsAligned(a, align)) 
return false; 

if (!AlignedMem::IsAligned(b, align)) 
return false; 

if (!AlignedMem::IsAligned(b, align)) 
return false; 


for (size t i = 0; i< n; i++) 
c[i] = sqrt(a[i] * a[i] + b[i] * b[i]); 


return true; 


) 
void CompareResults(const float* c1, const float* c2a, const float*c2b, size t n) 
( 

bool compare ok - true; 


const float epsilon - 1.0e-9f; 


cout «« fixed «« setprecision(4); 
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for (size t i = 0; i < n && compare ok; i++) 
{ 
bool b1 
bool b2 


fabs(ci[i] - c2a[i]) > epsilon; 
fabs(ci[i] - c2b[i]) > epsilon; 


cout << setw(2) << i << "- "; 

cout << setw(10) << ca[i] << ' '; 
cout << setw(10) << c2a[i] << ' '; 
cout << setw(10) << c2b[i] << '\n'; 


if (b1 || b2) 
compare ok = false; 


} 


if (compare ok) 

cout << "Array compare OK\n"; 
else 

cout << "Array compare FAILED\n"; 


} 


void NonTemporalStore(void) 
{ 
const size t n = 16; 
const size t align = 32; 


AlignedArray«float» a aa(n, align); 
AlignedArray«float» b aa(n, align); 
AlignedArray«float» c1 aa(n, align); 
AlignedArray«float» c2a aa(n, align); 
AlignedArray«float» c2b aa(n, align); 
float* a - a aa.Data(); 

float* b - b aa.Data(); 

float* c1 = c1 aa.Data(); 

float* c2a - c2a aa.Data(); 

float* c2b - c2b aa.Data(); 


Init(a, n, 67); 
Init(b, n, 79); 


bool rci = CalcResultCpp(c1, a, b, n); 
bool rc2 - CalcResultA (c2a, a, b, n); 
bool rc3 - CalcResultB (c2b, a, b, n); 


if (!rc1 


( 


| !xc2 || !rc3) 


cout << "Invalid return code\n"; 


cout << "rci = " << boolalpha << rci << '\n'; 


cout << "rc2 = " << boolalpha << rc2 << '\n'; 
cout << "rc3 = " << boolalpha << rc3 << '\n'; 
return; 


ADVANCED PROGRAMMING 
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cout << "Results for NonTemporalStore\n"; 
CompareResults(c1, c2a, c2b, n); 


) 
int main() 
( 
NonTemporalStore(); 
NonTemporalStore BM(); 
return 0; 
) 
> mm 
F Ch16_02.asm 


; _CalcResult Macro 

D 

; The following macro contains a simple calculating loop that is used 
; to compare performance of the vmovaps and vmovntps instructions. 


_CalcResult macro MovInstr 


; Load and validate arguments 


xor eax,eax ;set error code 


test r9,r9 

jz Done ;jump if n <= 0 

test r9,0fh 

jnz Done ;jump if (n % 16) != 0 
test rcx,1fh 

jnz Done ;junp if c is not aligned 
test rdx,1fh 

jnz Done ;junp if a is not aligned 
test r8,1fh 

jnz Done ;jump if b is not aligned 


; Calculate c[i] = sqrt(a[i] * a[i] + b[i] * b[i]) 


align 16 

QQ: vmovaps ymmO,ymmword ptr [rdx+rax] ;ymmo = a[i«7]:a[i] 
vmovaps ymm1,ymmword ptr [r8+rax] ;ymmi1 = b[i«7]:b[i] 
vmulps ymm2, ymmo, ymmo ;ymm2 = a[i] * a[i] 
vmulps ymm3,ymmi,ymmi ;ymm3 = b[i] * b[i] 
vaddps ymm4, ymm2, ymm3 ;ymm4 = sum 
vsqrtps ymm5,ymm4 ;ymm5 = final result 
MovInstr ymmword ptr [rcx«rax], ymm5 ;save final values to c 
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vmovaps ymmO,ymmword ptr [rdx+rax+32]  ;ymmo = a[i+15]:a[i+8] 
vmovaps ymm1,ymmword ptr [r8+rax+32] ;ymmi = b[i+15]:b[i+8] 


vmulps ymm2,ymmo, ymmo ;ymm2 = a[i] * a[i] 
vmulps ymm3,ymmi,ymmi ;ymm3 - b[i] * b[i] 
vaddps ymm4,ymm2, ymm3 ;ymm4 = sum 

vsqrtps ymm5,ymm4 ;ymm5 = final result 
MovInstr ymmword ptr [rcx+rax+32],ymm5  ;save final values to c 
add rax,64 ;update offset 

sub r9,16 ;update counter 

jnz GB 

mov eax,1 ;set success return code 


Done:  vzeroupper 
ret 
endm 


; extern bool CalcResultA (float* c, const float* a, const float* b, size t n) 


.code 
CalcResultA proc 
_CalcResult vmovaps 
CalcResultA endp 


; extern bool CalcResultB (float* c, const float* a, const float* b, int n) 


CalcResultB proc 
_CalcResult vmovntps 
CalcResultB_ endp 
end 


Near the top of Listing 16-2 is the C++ function CalcResultCpp. This function performs a simple 
arithmetic calculation using the elements of two single-precision floating-point source arrays. It then 
saves the result to a destination array. The next C++ function in Listing 16-2 is named CompareResults. 

This function verifies equivalence between the C++ and assembly language output arrays. The function 
NonTemporalStore allocates and initializes the test arrays. It then invokes the C++ and assembly language 
calculating functions. The output arrays of the three calculating functions are then compared for any 
discrepancies. 

The assembly language code in Listing 16-2 begin with the definition of a macro named  CalcResult. 
This macro generates AVX instructions that perform the exact same calculation as the C++ function 
CalcResultCpp. The macro CalcResult is used in the assembly language functions CalcResultA and 
CalcResultB .NotethatCalcResultA supplies the instruction vmovaps for the macro parameter MovInstr 
while CalcResultB provides vmovntaps. This means that the code executed by functions CalcResultA and 
CalcResultB is identical, except for the move instruction that saves results to the destination array. Here is 
the output for source code example Ch16 02; 
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Results for NonTemporalStore 


O - 240.8319 240.8319 240.8319 
1- 747.1814 747.1814 747.1814 
2- 285.1561 285.1561 285.1561 
3- 862.3062 862.3062 862.3062 
4 - 604.8810 604.8810 604.8810 
5 - 1102.4504 1102.4504 1102.4504 
6 - 347.1441 347.1441 347.1441 
7 - 471.8315 471.8315 471.8315 
8 - 890.6739 890.6739 890.6739 
9 - 729.0878 729.0878 729.0878 
10 - 458.3536 458.3536 458.3536 
11 - 639.8031 639.8031 639.8031 
12 - 1053.1063 1053.1063 1053.1063 


13 - 1016.0079 1016.0079 1016.0079 
14 - 610.4507 610.4507 610.4507 
15 - 1161.7935 1161.7935 1161.7935 
Array compare OK 


Running benchmark function NonTemporalStore BM - please wait 
Benchmark times save to file Ch16 02 NonTemporalStore BM CHROMIUM.csv 


Table 16-2 shows benchmark timing measurements for source code example Ch16 02 using several 
different Intel processors. In this example, using a vmovntps instruction instead of a vmovaps instruction 
yielded notable performance improvements on all three computers. It should be noted that the x86's 
non-temporal move instructions only provide a hint to the processor regarding memory use. They do 
not guarantee improved performance in all cases. Any performance gains are determined by the specific 
memory access pattern and the processor's underlying microarchitecture. 


Table 16-2. Mean Execution Times (Microseconds) for Functions CalcResultCpp, CalcResultA , and 
CalcResultB (n = 2,000,000) 


CalcResultA (uses CalcResultB (uses 
CPU CalcResultCpp vmovaps) vmovntps) 
i7-4790s 1553 1554 1242 
i9-7900x 1173 1139 934 
i7-8700k 847 801 590 


Data Prefetch 


An application program can also use the prefetch (Prefetch Data Into Caches) instruction to improve 

the performance of certain algorithms. This instruction facilitates pre-loading of expected-use data into 

the processor's cache hierarchy. There are two basic forms of the prefetch instruction. The first form, 
prefetcht[0|1|2], pre-loads temporal data into a specific cache level. The second form, prefetchnta, pre- 
loads non-temporal data while minimizing cache pollution. Both forms of the prefetch instruction provide 
hints to the processor about the data that a program expects to use; a processor may choose to perform the 
prefetch operation or ignore the hint. 
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The prefetch instructions are suitable for use with a variety of data structures, including large arrays and 
linked lists. A linked list is sequentially-ordered collection of nodes. Each node includes a data section and 
one or more pointers (or links) to its adjacent nodes. Figure 16-1 illustrates a simple linked list. Linked lists 
are useful since their size can grow or shrink (i.e., nodes can be added or deleted) depending on data storage 
requirements. One drawback of a linked list is that the nodes are usually not stored in a contiguously- 
allocated block of memory. This tends to increase access times when traversing the nodes a linked list. 

Source code example Ch16 03 illustrates how to perform linked list traversals both with and without the 
prefetchnta instruction. Listings 16-3 shows the C++ and assembly language source code for this example. 


Link to Next Node End-of-List Terminator 
| oma jf oaa ejj osta jep os je 


—r — 


Node 


Figure 16-1. Simple linked list 


Listing 16-3. Example Ch16_03 


#pragma once 
#include <cstdint> 


// This structure must match the corresponding structure definition in Ch16_03.asmh 
struct LlNode 


{ 


double ValA[4]; 
double ValB[4]; 
double ValC[4] 
double ValD[4]; 

uint8 t FreeSpace[376]; 
LiNode* Link; 


3 
» 


}; 


// Ch16 03 Misc.cpp 

extern bool LlCompare(int num nodes, LlNode* 11, LlNode* 12, LlNode* 13, int* node fail); 
extern LlNode* LlCreate(int num nodes); 

extern void LlDelete(LlNode* p); 

extern bool LlPrint(LlNode* p, const char* fn, const char* msg, bool append); 

extern void LlTraverse(LlNode* p); 


// Ch16 03 .asm 


extern "C" void LlTraverseA (LlNode* p); 
extern "C" void LlTraverseB (LlNode* p); 
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// Ch16 03 BM.cpp 
extern void LinkedListPrefetch BM(void); 


#include 
#include 
#include 


"stdafx.h" 
«iostream» 
«cmath» 


#include «random» 
#include "Ch16 03.h" 
#include "AlignedMem.h" 


using namespace std; 
void LinkedListPrefetch(void) 


{ 


const int num nodes = 8; 


LiNode* lista = LlCreate(num nodes); 
LlNode* list2a = LlCreate(num nodes); 
LlCreate(num nodes); 


LlNode* list2b 


LiTraverse(list1); 
LlTraverseA (list2a); 
LlTraverseB (list2b); 


int node fail; 
const char* fn - 


"Ch16 03 LinkedListPrefetchResults.txt"; 


cout << "Results for LinkedListPrefetch\n"; 


if (LlCompare(num nodes, listi, list2a, list2b, &node fail)) 


cout «« "Linked 
else 
cout «« "Linked 


LlPrint(listi, fn, 
LlPrint(list2a, fn, 
LlPrint(list2b, fn, 


cout «« "Linked list results saved to file 


LlDelete(list1); 
LiDelete(list2a); 
LiDelete(list2b); 
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list compare OK\n"; 


list compare FAILED - node fail = " << node fail << 


oe list1 ----- ", 0); 
"----- list2a ----- "y ys 
Me list2b ----- Sy d) 


<< fn << ‘\n'; 


"\n'; 


int main 


{ 


0 


LinkedListPrefetch(); 
LinkedListPrefetch BM(); 
return 0; 


#include 
#include 
#include 
#include 
#include 
#include 


"stdafx.h" 
«iostream» 
«cmath» 
«random» 

"Ch16 03.h" 
"AlignedMem.h" 


using namespace std; 


bool LlCompare(int num nodes, LlNode* 11, LlNode* 12, LlNode* 13, int* node fail) 


{ 


const double epsilon = 1.0e-9; 


for (int i = 0; i < num nodes; i++) 


{ 


*node fail = i; 


if ((11 == nullptr) || (12 == nullptr) || (13 == 


return false; 


for (int j = 0; j < 4; j++) 


12 
13 


bool b12 c = fabs(11->ValC[j] 
bool b13 c = fabs(11->ValC[j] 
if (b12 c || b13 c) 

return false; 


bool b12 d = fabs(11->ValD[j] 
bool b13 d = fabs(11->ValD[j] 
if (b12 d || b13 d) 

return false; 


li-»Link; 
12-»Link; 
13-»Link; 


12-»ValC[j] 
13-»ValC[j] 


12-»ValD[j] 
13-»ValD[j] 


) 
) 


) 
) 
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nullptr)) 


> epsilon; 
> epsilon; 


> epsilon; 
> epsilon; 
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*node fail = -2; 
if ((11 != nullptr) || (12 != nullptr) || (13 != nullptr)) 
return false; 


*node fail = -1; 
return true; 


} 


LiNode* LlCreate(int num nodes) 

( 
const size t align - 64; 
const unsigned int seed - 83; 
LlNode* first - nullptr; 
LiNode* last = nullptr; 
uniform int distribution«» ui dist (1, 500j; 
default random engine rng {seed}; 


for (int i = 0; i « num nodes; i++) 

( 
LiNode* p = (LlNode*)AlignedMem::Allocate(sizeof(LlNode), align); 
p-»Link - nullptr; 


if (i == 0) 
first = last = p; 
else 
{ 
last->Link = p; 
last = p; 
} 


for (int j = 0; j < 4; j++) 
{ 
p-»ValA[ 
p-»ValB[ 
p->Valc[ 
p->ValD[ 


(double)ui dist(rng); 
(double)ui dist(rng); 
0; 
0; 


j] 
j] 
j] 
j] 


} 


return first; 


} 


void LlDelete(LlNode* p) 


{ 
while (p != nullptr) 


{ 
LiNode* q = p->Link; 


AlignedMem: :Release(p); 
P = q; 
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bool LlPrint(LlNode* p, const char* fn, const char* msg, bool append) 


1 
FILE* fp; 
const char* mode - (append) ? "at" : "wt"; 
if (fopen s(&fp, fn, mode) !- 0) 
return false; 
int i = 0; 
const char* fs = "%14.41f %14.41f %14.41f %14.41f\n"; 
if (msg != nullptr) 
fprintf(fp, "\n%s\n", msg); 
while (p != nullptr) 
fprintf(fp, "AnLlNode %d [oxap]\n", i, p); 
fprintf(fp, " ValA: "); 
fprintf(fp, fs, p-»ValA[o], p-»ValA[1], p-»ValA[2], p-»ValA[3]); 
fprintf(fp, " ValB: "); 
fprintf(fp, fs, p->ValB[o], p-»ValB[1], p-»ValB[2], p-»ValB[3]); 
fprintf(fp, " ValC: "); 
fprintf(fp, fs, p->ValC[o], p-»ValC[1], p-»ValC[2], p-»ValC[3]); 
fprintf(fp, " ValD: "); 
fprintf(fp, fs, p->ValD[o], p-»ValD[1], p-»ValD[2], p-»ValD[3]); 
i++; 
p = p->Link; 
} 
fclose(fp); 
return true; 
} 
void LlTraverse(LlNode* p) 
1 
while (p != nullptr) 
{ 
for (int i = 0; i« 4; i++) 
p-»ValC[i] = sqrt(p-»ValA[i] * p->ValA[i] + p-»ValB[i] * p-»ValB[i]); 
p-»ValD[i] = sqrt(p-»ValA[i] / p-»ValB[i] + p-»ValB[i] / p-»ValA[i]); 
} 
p = p->Link; 
} 
J 
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; This structure must match the corresponding structure definition in Ch16 03.h 


L1Node struct 

ValA real8 4 dup(?) 

ValB real8 4 dup(?) 

ValC real8 4 dup(?) 

ValD real8 4 dup(?) 

FreeSpace byte 376 dup(?) 

Link qword ? 

L1Node ends 

3 a 
5 Ch16 03 .asm 


include <Ch16 03 .asmh» 
Macro LlTraverse 


The following macro generates linked list traversal code using the 
prefetchnta instruction if UsePrefetch is equal to 'Y'. 


Ve Le ve Le 


 LlTraverse macro UsePrefetch 


mov rax,rcx ;rax = ptr to ist node 
test rax,rax 
jz Done ;jump if empty list 
align 16 

QQ: : mov rcx, [raxeL1Node. Link] jrcx = next node 


vmovapd ymmO,ymmword ptr [rax+LlNode.ValA] ;ymmO = ValA 
vmovapd ymm1,ymmword ptr [rax+LLNode.ValB] ;ymm1 = ValB 


IFIDNI «UsePrefetch»,«Y» 
mov rdx,rcx 


test rdx,rdx ;is there another node? 
cmovz rdx,rax ;avoid prefetch of nullptr 
prefetchnta [rdx] ;prefetch start of next node 


ENDIF 


; Calculate ValC[i] = sqrt(ValA[i] * ValA[i] + ValB[i] * ValB[i]) 


vmulpd ymm2, ymmo, ymmo ;ymm2 = ValA * ValA 
vmulpd ymm3,ymmi,ymm1i ;ymm3 = ValB * ValB 
vaddpd ymm4, ymm2, ymm3 3ymm4 = sums 

vsqrtpd ymm5,ymm4 ;ymm5 = square roots 


vmovntpd ymmword ptr [rax«LlNode.ValC],ymms5 ;save result 
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; Calculate ValD[i] = sqrt(ValA[i] / ValB[i] + ValB[i] / ValA[i]); 


vdivpd ymm2, ymmo, ymm41 ;ymm2 = ValA / ValB 
vdivpd ymm3,ymmi, ymmo ;ymm3 = ValB / ValA 
vaddpd ymm4, ymm2, ymm3 ;ymm4 = sums 

vsqrtpd ymm5,ymm4 ;ymm5 = square roots 


vmovntpd ymmword ptr [rax«LlNode.ValD],ymms5 ;save result 


mov rax,rcx ;rax = ptr to next node 
test rax,rax 
jnz GB 


Done: | vzeroupper 
ret 
endm 


; extern "C" void LlTraverseA (LlNode* first); 


.code 
LlTraverseA proc 

_LiTraverse n 
LlTraverseA endp 


; extern "C" void LlTraverseB (LlNode* first); 


LlTraverseB proc 
 LlTraverse y 
LlTraverseB endp 
end 


Listing 16-3 begins with the header file Ch16 03.h. The declaration of structure L1Node is located near 
the top of this file. The C++ code uses this structure to construct linked lists of test data. Structure members 
ValA through ValD hold the data values that are manipulated by the linked list traversal functions. The 
member FreeSpace is included to increase the size of L1Node for demonstration purposes since prefetching 
works best with larger data structures. A real-world implementation of L1Node could use this space for 
additional data items. The final member of L1Node is a pointer named Link, which points to the next L1Node 
structure. The assembly language counterpart of L1Node is declared in the file Ch16 03 .asmh. 

The base function for source code example Ch16 03 is named LinkedListPrefetch and can be found 
in source code file Ch16 03.cpp. This function builds several test linked lists, invokes the C++ and assembly 
language traversal functions, and validates the results. The source code file Ch16 03 misc.cpp contains a set 
of miscellaneous functions that implement basic linked list processing operations. The function L1Compare 
compares the data nodes of its argument linked lists for equivalence. The functions L1Create and L1Delete 
perform linked list allocation and deletion. L1Print dumps the data contents of a linked list to a file. Finally, 
LlTraverse traverses a linked list and performs a simulated calculation using L1Node data elements ValA, 
ValB, ValC, and ValD. 

Toward the top of the file Ch16 03 .asmisa macro named LITraverse. This macro generates code 
that performs the same linked list traversal and simulated calculation as the C++ function L1Traverse. 

The macro LlTraverse requires one parameter that enables or disables data prefetching. If prefetching 
is enabled, the macro generates a short block code that includes the instruction prefetchnta [rdx]. 
This instruction directs the processor to prefetch the non-temporal data pointed to by register RDX. In 
this example, RDX points to the next L1Node in the linked list. The actual number of bytes fetched by this 
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instruction varies depending on the underlying microarchitecture; Intel processors fetch a minimum of 32 
bytes. It is extremely important to note that the prefetchnta [rdx] is positioned before the floating-point 
calculating instructions. Doing this gives the processor an opportunity to fetch the data for the next node 
while at the same time carrying out the arithmetic calculations for the current node. Also note that prior to 
execution of the prefetchnta [rdx] instruction, RDX is tested to avoid using prefetchnta with a nullptr 
(or zero) memory address since this degrades processor performance. This is important since nullptr is 
used as the end-of-list terminator value. The assembly language functions L1TraverseA_ and LlTraverseB 
use macro LlTraverse to perform linked list traversals both without and with prefetching, respectively. 
Here is the output for source code example Ch16 03. 


Results for LinkedListPrefetch 
Linked list compare OK 
Linked list results saved to file Ch16 03 LinkedListPrefetchResults.txt 


Running benchmark function LinkedListPrefetch BM - please wait 
Benchmark times save to file Ch16 03 LinkedListPrefetch BM CHROMIUM. csv 


Table 16-3 shows benchmark timing measurements for several Intel processors. It is important to 
keep in mind that any performance benefits provided by the prefetch instructions are highly dependent on 
current processor load, data access patterns, and the underlying microarchitecture. According to the Intel 64 
and IA-32 Architectures Optimization Reference Manual, the data prefetch instructions are "implementation 
specific" This means that to maximize prefetch performance, an algorithm must be "tuned to each 
implementation" or microarchitecture. You are encouraged to consult the aforementioned reference manual 
for additional information regarding use of the x86's data prefetch instructions. 


Table 16-3. Mean Execution Times (Microseconds) for Linked List Traversal Functions (num nodes = 50,000) 


CPU LiTraverse(C**)  LiTraverseA (without prefetchnta)  LlTraverseB (with prefetchnta) 
i7-4790s 5685 3093 2680 
i9-7900x 5885 3064 2842 
i7-8700k 5031 2384 2319 


Multiple Threads 


All the source code examples presented in this book thus far have shared one common characteristic: 
they all contain single-threaded code. The mere fact that you are reading this book probably means that 
you already know that most modern software applications utilize a least a few threads to better exploit 
the multiple cores of modern processors. For example, many high-performance computing applications 
frequently perform arithmetic calculations using large data arrays that contain millions of floating-point 
elements. One strategy that’s often employed to accelerate the performance of these types of calculations 
is to distribute the array elements across multiple threads and have each thread carry out a subset of 

the required calculations. The next source code example demonstrates how to perform an arithmetic 
calculation using large floating-point arrays and multiple threads. Listing 16-4 shows the source code for 
example Ch16 04. 
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Caution A processor can become extremely hot while executing multithreaded code that makes extensive 
use of x86-AVX instructions. Before running the code of example Ch16 04, you should verify that the processor 
in your computer has an adequate cooling system. 


Listing 16-4. Example Ch16_04 


#pragma once 
#include <vector> 


struct CalcInfo 

{ 
double* m X1; 
double* m X2; 
double* m Y1; 
double* m Y2; 
double* m Z1; 
double* m Z2; 
double* m Result; 
size t m Index; 
size t m Index1; 
int m Status; 


}; 


struct CoutInfo 

{ 
bool m ThreadMsgEnable; 
size t m Iteration; 
size t m NumElements; 
size t m ThreadId; 
size t m NumThreads; 


}; 


// Ch16 04 Misc.cpp 

extern size t CompareResults(const double* a, const double* b, size t n); 

extern void DisplayThreadMsg(const CalcInfo* ci, const CoutInfo* cout info, const char* 
nsg); 

extern void Init(double* a1, double* a2, size t n, unsigned int seed); 
std::vector«size t» GetNumElementsVec(size t* num elements max); 

std::vector«size t» GetNumThreadsVec(void); 


// Ch16 04 WinApi.cpp 
extern bool GetAvailableMemory(size t* mem size); 


// Ch16 04 .asm 
extern "C" void CalcResult (CalcInfo* ci); 
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// Miscellaneous constants 
const size t c ElementSize - sizeof(double); 


const size t c NumArrays = 8;  // Total number of allocated arrays 
const size t c Align - 32; // Alignment boundary (update Ch16 04 .asm if changed) 
const size t c BlockSize = 8;  // Elements per iteration (update Ch16 04 .asm if changed) 


#include "stdafx.h" 
#include <iostream> 
#include «random» 
#include <memory.h> 
#include <cmath> 
#include <mutex> 
#include <vector> 
#include <algorithm> 
#include "Ch16 04.h" 


using namespace std; 
void Init(double* a1, double* a2, size t n, unsigned int seed) 
{ 

uniform int distribution«» ui dist (1, 2000}; 


default random engine rng {seed}; 


for (size t i = 0; i< n; i++) 


£D 
[uy 
= 
He 
mmn 
Lu 


(double)ui dist(rng); 
(double)ui dist(rng); 


£D 
N 
= 
He 
— 
Il 


size t CompareResults(const double* a, const double* b, size t n) 


{ 
if (memcmp(a, b, n * sizeof(double)) == 0) 
return n; 


const double epsilon = 1.0e-15; 
for (size t i = 0; i < n; i++) 
{ 
if (fabs(a[i] - b[i]) > epsilon) 
return i; 


} 


return n; 
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void DisplayThreadMsg(const CalcInfo* ci, const CoutInfo* cout info, const char* msg) 


static mutex mutex cout; 
static const char nl = '\n'; 


mutex cout.lock(); 
cout «« nl «« msg «« nl; 


cout «« " m Iteration:  " «« cout info-»m Iteration «« nl; 
cout «« " m NumElements: " «« cout info-»m NumElements «« nl; 
cout << " m ThreadId: " «« cout info-»m ThreadId «« nl; 
cout «« " m NumThreads: ^" «« cout info-»m NumThreads «« nl; 
cout << " m Indexo: " << ci->m_Indexo << nl; 

cout << " m Indexi: " << ci-»m Index1 << nl; 


mutex cout.unlock(); 


) 


vector«size t» GetNumElementsVec(size t* num elements max) 


{ 


// vector«size t» ne vec (64, 192, 384, 512}; // Requires 32GB + extra 
vector«size t» ne vec (64, 128, 192, 256}; // Requires 16GB + extra 
// vector<size t> ne vec {64, 96, 128, 160}; // Requires 10GB + extra 


size t mem size extra gb - 2; // Used to avoid allocating all available mem 


size t ne max - *std::max element(ne vec.begin(), ne vec.end()); 


if ((ne max % c BlockSize) !- 0) 
throw runtime error("ne max must be an integer multiple of c BlockSize"); 


size t mem size; 


if (!GetAvailableMemory(&mem size)) 
throw runtime error ("GetAvailableMemory failed"); 


size t mem size gb - mem size / (1024 * 1024 * 1024); 
size t mem size min - ne max * 1024 * 1024 * c ElementSize * c NumArrays; 
size t mem size min gb - mem size min / (1024 * 1024 * 1024); 


if (mem size gb « mem size min gb + mem size extra gb) 
throw runtime error ("Not enough available memory"); 


*num elements max - ne max * 1024 * 1024; 
return ne vec; 


} 

vector<size t> GetNumThreadsVec(void) 

vector«size t» num threads vec (1, 2, 4, 6, 8}; 
return num threads vec; 

J 
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#include "stdafx.h" 
#include <iostream> 
#include <iomanip> 
#include <sstream> 
#include <stdexcept> 
#include «thread» 
#include <vector> 
#include "Ch16 04.h" 
#include "AlignedMem.h" 
#include "BmThreadTimer.h" 


using namespace std; 


// Control flag for streaming thread status information to cout. 
const bool c ThreadMsgEnable = false; 


void CalcResultCpp(CalcInfo* ci) 


{ 
size t al = c Align; 
size t iO = ci-»m IndexO0; 
size t i1 = ci-»m Index1; 
size t num elements = i1 - i0 + 1; 
ci-»m Status - 0; 
if (num elements == 0 || (num elements % c BlockSize) != 0) 
return; 
for (size t i = i0; i <= i1; i++) 
{ 
double xx = ci->m X1[i] - ci->m X2[i]; 
double yy = ci-»m Y1[i] - ci-»m Y2[i]; 
double zz = ci->m Z1[i] - ci-»m Z2[i]; 
ci-»m Result[i] = sqrt(1.0 / sqrt(xx * xx + yy * yy + zz * zz)); 
) 
ci-»m Status - 1; 
) 


static void CalcResultThread(CalcInfo* ci, CoutInfo* cout info) 


{ 
if (cout_info->m ThreadMsgEnable) 


DisplayThreadMsg(ci, cout info, "ENTER CalcResultThread()"); 


CalcResult (ci); 
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} 


if (cout_info->m ThreadMsgEnable) 
DisplayThreadMsg(ci, cout info, "EXIT CalcResultThread()"); 


void RunMultipleThreads(bool thread msg enable) 


{ 


// Code section #1 


size t align 


- c Align; 
size t num elements max; 
vector«size t» num elements vec = GetNumElementsVec(&num elements max); 
vector«size t» num threads vec - GetNumThreadsVec(); 


AlignedArray«double» x1 aa(num elements max, 
AlignedArray«double» x2 aa(num elements max, 
AlignedArray«double» yi aa(num elements max, 
AlignedArray«double» y2 aa(num elements max, 
AlignedArray«double» zi aa(num elements max, 
AlignedArray«double» z2 aa(num elements max, 


AlignedArray«double» resulti aa(num elements | 


AlignedArray«double» result2 aa(num elements | 
double* x1 = x1 aa.Data(); 
double* x2 = x2 aa.Data(); 
double* y1 = y1_aa.Data(); 
double* y2 - y2 aa.Data(); 
double* z1 = z1 aa.Data(); 
double* z2 = z2 aa.Data(); 


double* result1 - 


double* result2 


resulti aa.Data(); 
result2 aa.Data(); 


align); 
align); 
align); 
align); 
align); 
align); 
max, align); 
max, align); 


cout << "Begin initialization of test arrays Wn"; 


cout «« 


Init(x1, x2, num elements max, 307); 
cout << " 
Init(y1, y2, num elements max, 401); 


cout << 


Init(z1, z2, num elements max, 503); 
cout << "Finished initialization of test arrays\n"; 


CalcInfo cii; 


cilm: X1 = x1; ci 
cii.m Yi = y1; ci 


cii.m Z1 = z1 
cii.m Result 
cii.m IndexO 
cii.m Indexi 
cii.m Status 


s cr 
- res 
= 0; 

= num 
= -1; 


1.m X2 = x2; 
1.m Y2 = y2; 
1.m Z2 = 22; 
ulti; 


| elements max - 1; 


Initializing test arrays x1, x2\n"; 
Initializing test arrays y1, y2\n"; 


Initializing test arrays z1, z2\n"; 
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// CalcResultCpp used for verification purposes 
cout << "Begin execution of CalcResultCpp\n"; 
CalcResultCpp(&cit1) ; 

cout << "Finished execution of CalcResultCpp\n"; 


size t iteration = 0; 

const size t block size = c BlockSize; 

BmThreadTimer bmtt(num elements vec.size(), num threads vec.size()); 
// Code section #2 

cout << "Begin execution of calculating threads Wn"; 

for (size t i = 0; i « num elements vec.size(); i++) 


{ 


size t num elements = num elements vec[i] * 1024 * 1024; 
size t num blocks = num elements / block size; 
size t num blocks rem = num elements % block size; 


if (num blocks rem != 0) 
throw runtime error("num elements must be an integer multiple of block size"); 


for (size t j = 0; j « num threads vec.size(); j++) 
size t num threads - num threads vec[j]; 
bmtt.Start(i, j); 


size t num blocks per thread - num blocks / num threads; 
size t num blocks per thread rem = num blocks % num threads; 


vector«CalcInfo» ci2(num threads); 
vector«CoutInfo» cout info(num threads); 
vector«thread*» threads(num threads); 


// Thread start code 
for (size t k = 0; k « num threads; k++) 


{ 
ci2[k].m X1 = x1; ci2[k].m_X2 = x2; 
ci2[k].m Y1 = y1;  ci2[k].m Y2 = y2; 
ci2[k].m Z1 = zi;  ci2[k].m Z2 = z2; 
ci2[k].m Result = result2; 
ci2[k].m Indexo = k * num blocks per thread * block size; 
ci2[k].m Index1 = (k + 1) * num blocks per thread * block size - 1; 
ci2[k].m Status = -1; 


if ((k + 1) == num threads) 
ci2[k].m Index1 += num blocks per thread rem * block size; 
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cout info[k].m ThreadMsgEnable = thread msg enable; 
cout info[k].m Iteration - iteration; 

cout info[k].m NumElements - num elements; 

cout info[k].m NumThreads - num threads; 
cout info[k].m ThreadId - k; 


threads[k] = new thread(CalcResultThread, &ci2[k], &cout info[k]); 
) 


// Wait for all threads to complete 
for (size t k = 0; k « num threads; k++) 
threads[k]->join(); 
bmtt.Stop(i, j); 
size t cmp index = CompareResults(resulti, result2, num elements); 
if (cmp index != num elements) 
ostringstream oss; 


oss «« " compare error detected at index " «« cmp index; 
throw runtime error(oss.str()); 


) 
for (size t k = 0; k « num threads; k++) 
{ 
if (ci2[k].m Status !- 1) 
{ 
ostringstream oss; 
oss «« " invalid status code " «« ci2[k].m Status; 
throw runtime error(oss.str()); 
) 
delete threads[k]; 
) 
) 
iteration++; 


} 


cout << "Finished execution of calculating threads\n"; 


string fn = bmtt.BuildCsvFilenameString("Ch16 04 MultipleThreads BM"); 
bmtt.SaveElapsedTimes(fn, BmThreadTimer::EtUnit::MilliSec, 0); 
cout << "Benchmark times save to file " << fn << '\n'; 
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int main() 
{ 
try 
{ 
RunMultipleThreads(c_ThreadMsgEnable); 
} 
catch (runtime error& rte) 
{ 
cout << "'runtime error' exception has occurred - " 
) 
catch (:::) 
{ 
cout << "Unexpected exception has occurred\n"; 
} 
return 0; 


include <MacrosX86-64-AVX.asmh> 


CalcInfo struct 
X1 qword ? 


X2 qword 
Y1 qword 
Y2 qword 
Z1 qword 


Z2 qword ? 


Result qword 
Indexo qword 
Index1 qword 
Status dword 


CalcInfo ends 


r8_1p0 


.const 
real8 1.0 


; extern "C" void CalcResult (CalcInfo* ci) 


.code 


CalcResult proc frame 
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_CreateFrame CR,0,16,r12,r13,r14,r15 
_SaveXmmRegs xmm6 
_EndProlog 


mov dword ptr [rcx«CalcInfo.Status],O 


<< rte.what() << '\n'; 


; Make sure num elements is valid 


; Make sure all arrays are properly aligned 


mov rax, [rcx«CalcInfo.Indexo] 
mov rdx, [rcx«CalcInfo.Indexi] 
sub rdx,rax 

add rdx,1 

test rdx,rdx 

jz Done 

test rdx,7 

jnz Done 


mov r8d,1fh 

mov r9, [rcx«CalcInfo.Result] 
test r9,r8 

jnz Done 


mov r10, [rcx«CalcInfo.X1] 
test r10,r8 

jnz Done 

mov r11, [rcx«CalcInfo.X2] 
test r11,r8 

jnz Done 


mov r12, [rcx«CalcInfo.Y1] 
test r12,r8 

jnz Done 

mov r13, [rcx«CalcInfo.Y2] 
test r13,r8 

jnz Done 


mov r14, [rcx«CalcInfo.Z1] 
test r14,r8 

jnz Done 

mov r15,[rcx«CalcInfo.Z2] 
test r15,r8 

jnz Done 


vbroadcastsd ymm6,real8 ptr [r8 1po] 


; Perform simulated calculation 


LP1: 


align 16 


CHAPTER 16 


start index 
stop index 


;rax 
;rdx 


;rdx = num elements 


;jump if num elements == 
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;jump if num elements % 8 != O 


;ymm6 = packed 1.0 (DPFP) 


vmovapd ymmO,ymmword ptr 
vmovapd ymm1,ymmword ptr 
vmovapd ymm2,ymmword ptr 
vsubpd ymmo, ymmo, ymmword 
vsubpd ymm1, ymm1, ymmword 
vsubpd ymm2,ymm2, ymmword 
vmulpd ymm3,ymmO, ymmo 
vmulpd ymm4,ymm1, ymm1 
vmulpd ymm5,ymm2, ymm2 
vaddpd ymmo, ymm3 , ymm4 


[r10+rax*8] 
[r12+rax*8] 
[r14+rax*8] 
ptr [r11+rax*8] 
ptr [r13+rax*8] 
ptr [r15+rax*8] 
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vaddpd ymmi, ymmo, ymm5 

vsqrtpd ymm2,ymmi 

vdivpd ymm3,ymm6,ymm2 

vsqrtpd ymm4, ymm3 

vmovntpd ymmword ptr [r9+rax*8],ymm4 


add rax,4 


vmovapd ymmo,ymmword ptr [r10+rax*8] 
vmovapd ymm1,ymmword ptr [r12+rax*8] 
vmovapd ymm2,ymmword ptr [r14+rax*8] 
vsubpd ymmo, ymmo,ymmword ptr [r11+rax*8] 
vsubpd ymm1,ymm1,ymmword ptr [r13+rax*8] 
vsubpd ymm2,ymm2,ymmword ptr [r15+rax*8] 
vmulpd ymm3, ymmo, ymmo 

vmulpd ymm4,ymm1,ymmi 

vmulpd ymm5,ymm2, ymm2 

vaddpd ymmo, ymm3 , ymm4 

vaddpd ymm1, ymmo, ymm5 

vsqrtpd ymm2, ymm4 

vdivpd ymm3,ymm6,ymm2 

vsqrtpd ymm4,ymm3 

vmovntpd ymmword ptr [r9«rax*8],ymm4 


add rax,4 
sub rdx,8 
jnz LP1 


mov dword ptr [rcx+CalcInfo.Status],1 


Done: vzeroupper 
 RestoreXmmRegs xmm6 
_DeleteFrame r12,r13,r14,r15 
ret 

CalcResult endp 
end 


Source code example Ch16 04 performs a simulated calculation using large arrays of double-precision 
floating-point values across multiple threads. It uses the C++ STL class thread to run multiple instances of 
an AVX2 assembly language calculating function. Each thread performs its calculations using only a portion 
of the array data. The C++ driver routine exercises the calculating algorithm using various combinations of 
array sizes and simultaneously executing threads. It also implements benchmark timing measurements to 
quantify the performance benefits of the multithreaded technique. 

Listing 16-4 begins the header file Ch16 04.h. In this file, the structure CalcInfo contains the data that 
each thread needs to carry out its calculations. The structure members m X1,m X2, m Y1,m Y2,m Z2, and 
m Z2, point to the source arrays, while m Result points to the destination array. Members m_Index0 and 
m Indexi are array indices that define a range of unique elements for each calculating thread. Header file 
Ch16 04.h also includes a structure named CoutInfo, which contains status information that's optionally 
displayed during program execution. 
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The next file in Listing 16-4 is Ch16 04 Misc.cpp. This file contains the source code for the program's 
ancillary functions. The functions Init and CompareResults perform array initialization and verification, 
respectively. The next function is DisplayThreadMsg. This function displays status information for each 
executing thread. Note that the cout statements in DisplayThreadMsg are synchronized with a C++ STL mutex. 
Amutex is a synchronization object that facilitates controlled access to a single resource by multiple threads. 
When mutex cout is locked, only one thread can stream its status results to cout. The other executing threads 
are blocked from streaming their results to cout until nutex cout becomes unlocked. Without this mutex, 
status information text from the executing threads would be intermingled on the display (if you're interested 
in seeing what happens, try commenting out the mutex_cout. lock and mutext_cout.unlock statements). 

The function GetNumElementsVec returns a vector that contains the sizes of the test arrays. Note that 
the amount of memory required by the largest test array should be less than the amount of available memory 
plus a small fudge factor. The fudge factor prevents the program from allocating all available memory. 

Also note that GetNumElementsVec throws an exception if insufficient memory is available since running 
the program in this manner is very slow due to the amount of page swapping that occurs. The function 
GetNumThreadsVec returns a vector of test thread counts. You can change the values in num threads vec to 
experiment with different thread count values. 

The source code file Ch16 04.cpp contains the driver routines for source code example Ch16 04. 

The function CalcResultCpp is a C++ implementation of the simulated calculating algorithm and is used 
for result verification purposes. The next function, CalcResultThread, is the main thread function. This 
function invokes the assembly language calculating function CalcResult_. It also displays thread status 
messages if they're enabled. 

Following CalcResultThread is the function RunMultipleThreads. This function exercises the 
calculating algorithm using the specified combinations of array sizes and number of simultaneously 
executing threads. The first code section of RunMultipleThreads performs array allocation and element 
initialization. It also calls the function CalcResultCpp to calculate results values for algorithm verification 
purposes. Note that prior to calling this function, an instance of CalcInfo is initialized with the data that's 
necessary to carry out the required calculations. 

The second code section of RunMultipleThreads, which starts immediate after the comment line Code 
section #2, runs the calculating algorithm by distributing the test array elements across multiple threads. 
The test array elements are spilt into groups based on the number of threads that will execute. For example, 
if the number of test array elements equals 64 million, launching four threads will result in each thread 
processing 16 million elements. The inner most for loop that follows the commentline Thread start code 
begins each iteration by initializing an instance of CalcInfo for the next thread. The statement threads[k] 
= new thread(CalcResultThread, &ci2[k], &cout info[k]) constructs a new thread object and starts 
execution of the thread function CalcResultThread using argument values &ci2[k] and &cout info[k]. 
This inner for loop repeats until the required number of executing thread have been launched. While the 
threads are executing, the function RunMultipleThreads executes a small for loop that invokes the function 
thread: : join. This effectively forces RunMultipleThreads to wait until all executing threads have finished. 
The remaining code in RunMultipleThreads performs data verification and object cleanup. 

Before reviewing the assembly language code, the C++ code in RunMultipleThreads merits a few 
additional comments. The first thing to note is that the benchmarking code measures both the time it 
takes to carry out the required calculations and the overhead that's associated with thread management. 

If the algorithm that's employed by RunMultipleThreads were to be used in a real-world application, any 
benchmark timing measurements would be meaningless without factoring in this overhead. It should also 
be noted that RunMultipleThreads implements an extremely rudimentary form of multithreading that omits 
many important real-world operations to simplify the code for this example. If you're interested in learning 
more about the C++ STL thread class and the other STL classes that facilitate multithreaded processing, you 
are strongly encouraged to consult the references listed in Appendix A. 

The final file in Listing 16-4 is Ch16 04 .asm. Near the top of this file is the assembly language 
counterpart of the data structure CalcInfo. The assembly language function CalcResult is next. 

Following its prolog, CalcResult uses the instructions mov rax, [rcx«CalcInfo.Indexo] and mov 
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rdx, [rcx«CalcInfo.Index1]to load registers RAX and RDX with the indices of the first and last array 
elements. It then calculates and validates num elements. Next, the test arrays are validated for proper 
alignment. The processing loop in CalcResult uses AVX packed double-precision floating-point arithmetic 
to carry out the same simulated calculation as CalcResultCpp. The output for source code example Ch16 04 
follows this paragraph. Note that this output was produced with c ThreadMsgEnable set to false. Setting 
this flag to true ultimately directs the function CalcResultThread to display status messages for each 
executing thread. The flag c ThreadMsgEnable is defined near the top of Ch16 04.cpp. 


Begin initialization of test arrays 
Initializing test arrays x1, x2 
Initializing test arrays y1, y2 
Initializing test arrays z1, z2 
Finished initialization of test arrays 
Begin execution of CalcResultCpp 
Finished execution of CalcResultCpp 
Begin execution of calculating threads 
Finished execution of calculating threads 
Benchmark times save to file Ch16 04 MultipleThreads BM CHROMIUM. csv 


Tables 16-4, 16-5, and 16-6 contain the benchmark timing measurements for source code example Ch16 04. 
The measurements shown in these tables are the mean execution times from 10 separate runs and were made 
with 32 GB of SDRAM installed in each test computer. All three test computers show a significant improvement 
in performance when multiple threads are used to carry out the simulated calculation. For the i7-4900s and 
i7-8700k test computers, optimal performance is attained using four threads. The i9-7900x test computer 
shows meaningful performance gains when using six or eight threads. It is interesting to compare the timing 
measurements for the i9-7900x and i7-8700k systems. When using one or two threads, the i7-8700k out performs 
the i9-7900x while the opposite is true when four or more threads are employed. These measurements make 
sense when considering the hardware differences between the test processors, which are shown in Table 16-7. 
Even though the i7-8700k employs higher-clock frequencies, the extra memory channels ofthe i9-7900x enable 
itto better utilize its CPU cores and complete the required calculations more quickly. 


Table 16-4. Benchmark Timing Measurements (Milliseconds) for RunMultipleThreads 
Using an Intel i7-4790s Processor 


Number of Threads 


Number of Elements 1 2 4 6 8 
(Millions) 

64 686 226 178 180 172 
128 1146 491 345 355 347 
192 1592 702 513 530 516 
256 2054 942 679 714 688 
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Table 16-5. Benchmark Timing Measurements (Milliseconds) for RunMultipleThreads 
Using an Intel i9-7900x Processor 


Number of Threads 


Number of Elements 1 2 4 6 8 
(Millions) 

64 492 137 84 69 61 
128 765 300 163 131 121 
192 1110 454 233 193 178 
256 1330 582 313 260 238 


Table 16-6. Benchmark Timing Measurements (Milliseconds) for RunMultipleThreads 
Using an Intel i7-8700k Processor 


Number of Threads 


Number of Elements 1 2 4 6 8 
(Millions) 

64 332 125 120 123 123 
128 522 265 240 245 246 
192 839 387 363 366 369 
256 919 499 478 484 492 


Table 16-7. Summary of Hardware Features for Test Processors Used in Example Ch16 04 


Hardware Feature i7-4790s i9-7900x i7-8700k 
Number of cores 4 10 6 

Number of threads 8 20 12 

Base frequency (GHz) 3:2 3.3 3.7 
Maximum frequency (GHz) 4.0 4.5 4.7 
Memory type DDR3-1600 DDR4-2666 DDR4-2666 
Number of memory channels 2 4 2 
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Summary 


Here are the key learning points for Chapter 16: 


e An application program should always use the cpuid instruction to verify processor 
support for specific instruction set extensions. This is extremely important for 
software compatibility with future processors from both AMD and Intel. 


e An assembly language function can use the non-temporal store instructions 
vmovntp[d|s] instead of the vmovap[d|s] instructions to improve the performance 
of algorithms that carry out calculations using large arrays of non-temporal floating- 
point data. 


e An assembly language function can use the prefetch[0|1|2] instructions to pre- 
load temporal data into the processor's cache hierarchy. A function can also use 
the prefetchnta instruction to pre-load non-temporal data and minimize cache 
pollution. The performance benefits of the prefetch instructions vary depending on 
data access patterns and the processor's underlying microarchitecture. 


e Amultithreaded algorithm that's implemented in a high-level language such as 
C++ can exploit AVX, AVX2, or AVX-512 assembly language calculating functions to 
accelerate an algorithm's overall performance. 
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Appendix A includes supplemental material about the following items: 
e Software utilities for x86 processors 
e Visual Studio 


e References 


Software Utilities for x86 Processors 


The following utilities can be used to determine which x86 instruction set extensions are supported by the 
processor in your computer: 


CPUID CPU-Z (https : / /www. cpuid.com) 
HWiNFO Diagnostic Software (https: / /www.hwinfo. com) 


Piriform SPECCY (https: //www. ccleaner.com/speccy) 


Visual Studio 


In this section, you'll learn how to use Microsoft's Visual Studio development tool to run the source code 
examples that are described in the main text. You'll also learn how to create a simple Visual Studio C++ 
project. Before proceeding, you may want to refer to the Introduction for additional information regarding 
Visual Studio and the recommended hardware platforms for running the source code examples. The 
Introduction also contains important details about downloading the source code ZIP files for each chapter. 

Visual Studio uses logical entities called solutions and projects to help simplify application development. 
A solution is a collection of one or more projects that are used to build an application. A project is container 
object that organizes an application's files. A Visual Studio project is usually created for each buildable 
component of an application (e.g., executable file, dynamic-linked library, static library, and so on). 

A standard Visual Studio C++ project includes two solution configurations named Debug and Release. 
As implied by their names, these configurations support separate executable builds for initial development 
and final release. A standard Visual Studio C++ project also incorporates solution platforms. The default 
solution platforms are named Win32 and x64, which contain the necessary settings to build 32-bit and 64-bit 
executables, respectively. The Visual Studio solution and project files for this book's source code examples 
include only the x64 platform. 


© Daniel Kusswurm 2018 585 
D. Kusswurm, Modern X86 Assembly Language Programming, 
https://doi.org/10.1007/978-1-4842-4063-2 


APPENDIX À 


Running a Source Code Example 
You can use the following steps to run any of the book's source code examples: 


1. Using File Explorer, double-click on the chapter's Visual Studio solution (.s1n) 
file. The solution file is included in the chapter source code ZIP file. 


2. Fromthe menu bar, select Build | Configuration Manager. In the Configuration 
Manager dialog box, set Active Solution Configuration to Release. Then set 
Active Solution Platform to x64. Note that these options may already be selected. 


3. Ifnecessary, select View | Solution Explorer to open the Solution Explorer 
window. 


4. IntheSolution Explorer window, right-click on a project to run and choose 
Set as StartUp Project. 


5. Select Debug | Start Without Debugging to run the program. 


Some ofthe source code examples reference data files in different folders using fixed path names. 
To run the corresponding executables using a different folder structure than the one used for Visual Studio 
development, you may need to change the path name strings in the C++ source code. 


Creating a Visual Studio C++ Project 


In this section, you'll learn how to create a simple Visual Studio project that includes both C++ and assembly 
language source code files. The ensuing paragraphs describe the same basic procedure that was used to 
create the source code examples in the main text and includes the following phases: 


e Create a C++ project 

e Enable MASM support 

e Add an assembly language file 
e Set project properties 

e  Editthe source code 


e Build and run the project 


Create a C++ Project 

Use the following steps to create a Visual Studio C++ project: 
1. Start Visual Studio. 
2. Select File | New Project. 


3. In the New Project dialog box control tree, select Installed | Visual C++ | 
Windows Desktop. 


4. Select Windows Console Application for the project type. 


5. Inthe Name text box, enter Example1. 
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6. Inthe Location text box, enter a folder name for the project location. You can 
also use the Browse button to choose a folder or leave the text unchanged to use 
the default location. 
7. IntheSolution text box, enter TestSolution. 
8. Verify that the New Project dialog box settings are the same as the ones shown in 
Figure A-1 (the Location can be different). Click OK. 
9. Ifnecessary, select View | Solution Explorer to open the Solution Explorer 
window. 
10. IntheSolution Explorer tree control, right-click on the top-level text that's 
labeled Solution ‘Example 1’ (1 Project) and select Rename. Change the 
solution name to TestSolution. 
11. Select Build | Configuration Manager. In the Configuration Manager dialog box, 
choose «Edit...» under Active Solution Platforms (see Figure A-2). 
12. Inthe Edit Solution Platforms dialog box, select x86 and click Remove 
(see Figure A-3). Click Close to close the Edit Solutions Platforms dialog box; 
click Close to close the Configuration Manager dialog box. 
> Recent Sort by; | Default M Search (Ctrl« E) P~- 
IER BM Windows Console Application VisualC++ Types Visual C++ 
b Visual C = A project for creating a Windows console 
b Visual Basic E] Windows Desktop Application Visual C++ application 
4 Visual C++ xs 
Windows Desktop oh Dynamic-Link Library (DLL) Visual C++ 
General H E 
Windows Universal i-i Static Library Visual C++ 
CLR 
ATL el Windows Desktop Wizard Visual C++ 
CMake 
Test 
b JavaScript 
b Other Project Types 
> Online 
Not finding what you are looking for? 
Open Visual Studio Installer 
Name: Example 
Location: [D:\VsProjects\ -] | Browse... 
Solution name: Create directory for solution 
[] Add to Source Control 
| cnca | 
Figure A-1. New Project dialog box 
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Configuration Manager 


Active solution configuration: Active solution platform: 
Debug 


Figure A-2. Configuration Manager dialog box 


Edit Solution Platforms 


Figure A-3. Edit Solution Platforms dialog box 
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Enable MASM Support 


Use the following steps to enable support for Microsoft Macro Assembler: 


1. IntheSolution Explorer tree control, right-click on Examplel and select Build 
Dependencies | Build Customizations. 


2. Inthe Visual C++ Build Customizations dialog box, check masm(.targets, .props). 


3. ClickOK. 


Add an Assembly Language File 
Use the following steps to add an assembly language source code file (. asm) to a Visual Studio C++ project: 


1. IntheSolution Explorer tree control, right-click on Examplel and select 
Add | New Item. 


2. Select C++ File (.cpp) for the file type. 


3. Inthe Name text box, change the name to Example1_.asm, as shown in 
Figure A-4. Note that the trailing underscore is required since all C++ and 
assembly language source code files in a project must have a unique base name. 


4. ClickAdd. 


4 Installed Sort by: | Default M Search (Ctrl« E) P- 
++ 
+ es CT C++ Fite (cpp) Visual C++ Type: Visual C++ 
Code T Creates a file containing C++ source code 
cn [K] Header Fite cm Visual C++ 
Mesource M C++ Class Visual C++ 
Web 
Utility 
Property Sheets 
HLSL 
Graphics 
> Online 
Name: 
Location: [D:\VsProjects\TestSolution\Examplet\ “| Browse... | 
cones 


Figure A-4. Add New Item dialog box 
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Set Project Properties 


Use the following steps to set the project's properties. The properties that control listing file generation 
(Steps 5 - 8) are optional. 


1. IntheSolution Explorer tree control, right-click on Examplel and select 
Properties. 


2. Inthe Property Pages dialog box, change the Configuration setting to All 
Configurations and the Platform setting to All Platforms. Note that one or both 
options may already be set. 


3. Inthetree control, select Configuration Properties | General. Change the 
setting Whole Program Optimization to No Whole Program Optimization 
(see Figure A-5). 


4. Select Configuration Properties | C/C++ | Code Generation. Change the setting 
Enable Enhanced Instruction Set to Advanced Vector Extensions (/arch:AVX) 
(see Figure A-6) 


5. Select Configuration Properties | C/C++ | Output Files. Change the setting Assembler 
Output to Assembly Machine and Source Code (/FAcs) (see Figure A-7). 


6. Select Configuration Properties | Microsoft Macro Assembler | Listing File. 
Change the setting Enable Assembly Generated Code Listing to Yes (/Sg) 
(see Figure A-8). 


7. Change the Assembled Code Listing File text field to $(IntDir)\%(filename).1st 
(see Figure A-8). This macro text specifies the project's intermediate directory, 
which is a subfolder of the main project folder. 


8. ClickOK. 


Configuration: | All Configurations 


4 Configuration Properties 


vvvvvvvvov 


eneral 
Debugging 

VC++ Directories 

C/C++ 

Linker 

Manifest Tool 

XML Document Generator 
Browse Information 

Build Events 

Custom Build Step 
Microsoft Macro Assembli 
Code Analysis 


v Platforme |All Platforms 


v General 
Target Platform 
Windows SDK Version 
Output Directory 
Intermediate Directory 
Target Name 
Target Extension 
Extensions to Delete on Clean 
Build Log File 
Platform Toolset 
Enable Managed Incremental Build 
w Project Defaults 
Configuration Type 
Use of MFC 
Character Set 
Common Language Runtime Support 
NET Target Framework Version 
Whole Program Optimization 
Windows Store App Support 


Whole Program Optimization 


v Cenfiguration Manager... 


Windows 10 
10.0.16299.0 
«different options» 
different options» 
S(ProjectName) 
ee 
"<df".cache”.obj;".objenc;" ilk" ipdb;" job" resources; ".tb;".tli;".tlh;".tmp;".rsp;".pgc;".pg 
S(IntDir)$(MSBuildProjectName).log 
Visual Studio 2017 (v141) 

No 


Application (exe) 

Use Standard Windows Libraries 

Use Unicode Character Set 

No Common Language Runtime Support 


No Whole Program Optimization 
No 


Specifies that the program will be optimized across .obj boundaries; works best if C/C++ optimizations are turned on. 


[ o i cancel || Apply 


Figure A-5. Property Pages dialog box (Whole Program Optimization) 
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Conte Pelom MPa | gro Mange | 
4 Configuration Properties A Enable String Pooling 
General Enable Minimal Rebuild No (/Gm-) 
Debugging Enable C++ Exceptions. Yes (/EHsc) 
NC- - Directories Smaller Type Check No 
4 Cee Basic Runtime Checks «different options» 
General Runtime Library <different options> 
ee, Struct Member Alignment Default 
Me eh — Security Check Enable Security Check (/65) 
| iesin ] Control Flow Guard 
es vaca ts Enable Function-Level Linking «different options» 
o Files Eve Parallel Code Al 
ORACLE : tio Advanced Vector Extensions (/arch:AVX) Iv] 
Advanced eating Point Model Precise (/fp:precise) 
All Options Enable Floating Point Exceptions 
Command Line Create Hotpatchable Image 
b Linker 
b Manifest Tool 
p XML Document Genera 
b Browse Information 
b Build Events M 
b Custom Build Step Enable Enhanced Instruction Set 
b Microsoft Macro Assen |, | | Enable use of instructions found on processors that support enhanced instruction sets, eg, the SSE, SSE2, AVX, and AVX2 enhancements to LA-32; AVX and AVX2 to x64. 
HH. — NE Currently /archcSSE and /arch:SSE2 are only available when building for the x86 architecture. If no option is specified, the compiler will use instructions found on proce... 


ET [c ] | Ae ] 


Figure A-6. Property Pages dialog box (Enable Enhanced Instruction Set) 


Example! Property Pages 


Configuration: |All Configurations vj Platform: |AN Platforms 


4 Configuration Properties — eed Source 
General E sey. chine Code and Source UT) 
Debugging Use Unicode For Assan Listing 
VC++ Directories ASM List Location S(ntDir) 
4 CC++ Object File Name S(IntDir) 
General : Program Database File Name S(IntDir)vcS(PlatformToolsetVersion).pdb 
Optimization Generate XML Documentation Files No 


Preprocessor XML Documentation File Name SüntDir) 
Code Generation 


Language 
Precompiled Headers 
Output Files 
Browse Information 
Advanced 
All Options 
Command Line 
Linker 
Manifest Tool 
XML Document Generator 
Browse Information 
Build Events 
Custom Build Step 


Assembler Output 
Specifies the contents of assembly language output file. — (/FA, /FAc, /FAs, /FAcs) 


Figure A-7. Property Pages dialog box (Assembler Output) 
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Example! Property Pages 


Configuration: [an Configurations vw Platform: All Platforms i i Configuration Manager... 
4 Configuration Properties Generate Preprocessed Source Listing ] 

General List All Available Information 
Debugging Add First Pass Listing 
VC++ Directories Enable Assembly Generated Code Listing Yes (Sg) 

b C/C++ Disable Symbol Table 

P Linker Enable False Conditionals In Listing 

> Manifest Tool Assembled Code Listing File SlintDin\%{filename).ist 


b XML Document Generator 
b Browse Information 

b Build Events 

p Custom Build Step 

4 Microsoft Macro Assemblé 


General 
Command Line 
Advanced 
Object File 
b Code Analysis 
ESS 
Generates an assembled code listing file. — (/FI[file]) 
< > 


[OK] | Conce || Ae | 


Figure A-8. Property Pages dialog box (Microsoft Macro Assembler Listing File) 


Edit the Source Code 


Use the following steps to edit the project source code: 

1. Inthe Editor window, click on the tab named Examplel.cpp. 

2. Edit the C++ source code to match the code that's shown in Listing A-1. 
3. Clickon the tab named Examplel .asm. 
4 


Edit the assembly language source code to match the code that's shown in 
Listing A-2. 


5. Select File | Save All. 


Listing A-1. Examplel.cpp 


// Example1.cpp : Defines the entry point for the console application. 
// 


#include "stdafx.h" 
#include <iostream> 


using namespace std; 

extern "C" int CalcResulti (int vali, int val2, int* quo, int* rem); 
int main() 

{ 


int vali = 42; 
int val2 = 9; 
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int prod = CalcResulti (vali, val2, &quo, &rem); 


cout 
cout 
cout 
cout 
cout 
cout 
retu 


) 


Example1\n"; 
vali << ‘\n'; 
val2 << ‘\n'; 
quo << '\n'; 
rem << ‘\n'; 
prod << ‘\n'; 


<< "Results for 
<< "vali =" << 
<< "val2 =" << 
<< "quo = "<< 
<< "rem= " << 
<< "prod = " << 
rn 0; 


Listing A-2. Examplel .asm 


; extern "C" int CalcResulti (int vali, int val2, int* quo, int* rem); 


CalcResu 


CalcResu 


.code 

1t1_ proc 
mov r10d,ecx 
mov r11d,edx 


mov eax,ecx 
cdq 


idiv r11d 
mov dword ptr [r8],eax 
mov dword ptr [r9],edx 


imul r10d,r11d 
mov eax,r10d 
ret 

lti endp 

end 


Build and Run the Project 


Use the following steps to build and run the project: 


1. 


2 
3 
4 
5 


Select Build | Build Solution. 


vali 
val2 


;r10d 
;r11d 


;eax - vali 
;edx:eax = vali 


;calc vali / val2 
;save quotient 
;save remainder 


;r10d = vali * val2 
;eax = vali * val2 


If necessary, fix any reported C++ compiler or MASM errors and repeat Step 1. 


Select Debug | Start Without Debugging. 


Verify that the output matches the console window shown in Figure A-9. 


Press Enter to close the console window. 


593 


APPENDIX À 


prod = 378 
Press any key to continue . . . 


Figure A-9. Console window output 
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473, 502 
AlignedMem, 238 
array, 343 
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unique ptr, 302, 397, 473 
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rvalue, 355 
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cache line, 531, 539 
L1 data (D-Cache), 530-531 
L1 instruction (I-Cache), 530-532 
L2, 530-532 
L3, 530-532 
non-temporal data, 557, 562 
pollution, 557, 562 
slice, 531 
temporal data, 557, 562 
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variable size, 388 
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response signal, 385-386 
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serializing instruction, 554 
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Data blend, 285, 333 
Data gather 
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Enhanced bit manipulation 
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FMA. See Fused-Multiply-Add (FMA) 
FMA3. See Fused-Multiply-Add (FMA) 
FMA4. See Fused-Multiply-Add (FMA) 
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doubleword, 3-4 


little endian ordering, 3 
proper alignment, 4 
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Fused-Multiply-Add (FMA) 
arithmetic, 281 
convolution functions 
packed, 398-406 
scalar, 388-398 
data dependencies 
multiple registers, 397 
operand ordering scheme, 282 
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rounding 
MXCSR.RC, 281 
scalar, 281 
value discrepancies, 406 
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sign bit, 280 
significand, 280 
F16C, 280 
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rounding mode, 418 


IEEE 754 
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exponent, 95-96 
sign bit, 95 
significand, 95-96 
special values 
denormal, 96 
floating-point zero, 96 
infinity, 96 
NaN, 96 
QNaN, 96 
SNaN, 96 
Image processing 
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mean, 510, 517, 519 
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image thresholding 
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F16C, 280, 385 
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LZCNT, 280, 282-283, 406 
POPCNT, 280, 282-283 

Integer arithmetic 
addition, 22-24 
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mixed sizes, 35-40 
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Jump table, 182 
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Linked list 
node 
data, 569 
end-of-list terminator, 570 
link, 569 
Loop unrolling, 239 
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601 


INDEX 


MASM. See Microsoft Macro Assembler (MASM) 
Matrix operations 

inverse 

Cayley-Hamilton theorem, 329 

multiplication, 207-213, 312-320 

transposition, 199-204, 206-207, 312-320 
Matrix-vector multiplication 

equations, 476, 482 

permutation of vector components, 482 
Memory addressing modes 

base register, 12, 41 

base register + disp, 12 

base register + index register, 12, 42 

base register + index register + disp, 12 


base register + index register * scale factor, 12, 42 


base register + index register * scale factor + 
disp, 12, 42 
effective address calculation, 11 
index * scale factor + disp, 12 
RIP + disp (RIP relative), 12 
RIP relative, 42 
Microarchitecture 
Coffee Lake, 529 
Haswell, 529, 534 
Kaby Lake, 529 
Skylake, 529-531, 533 
Skylake Server, 421, 433, 529, 534 
Micro-op 
macro-fusion, 532 
micro-fusion, 532 
Microsoft Macro Assembler (MASM) 
comment line, 23 
custom segment, 290, 330 
directive 
=, 147 
align, 171, 262, 290, 330, 450 
.allocstack, 146, 151 
bcst, 463 
byte ptr, 38 
catstr, 375 
.code, 23 
.const, 42 
.data, 43 
dup, 290 
dword, 26, 42, 290 
dword ptr, 38 
endp, 23 
.endprolog, 56, 146-147, 151 
ends, 330 
equ, 127, 147 
.erridni, 375 
macro, 203 
proc, 23 
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proc frame, 56, 164 
.pushreg, 56, 151 
qword, 262, 290 
qword ptr, 38 
readonly, 330 
real4, 111 
real8, 114 
.savexmm128, 157 
segment, 330 
.setframe, 146 
substr, 375 
word ptr, 38 
xmmword ptr, 172 
ymmword ptr, 290, 351 
zmmword ptr, 450 
label, 47 
location counter ($), 42 
macro text string, 375 


Miscellaneous data types 


bit field, 6 
bit string, 6 
string, 6 


Multithreading 


data arrays, 570 


MXCSR 


control flags, 97, 98, 128, 133-134 
rounding control, 98, 100 
rounding mode, 128, 133-134 
status flags, 92, 97, 100, 122 


N 


Non-temporal memory store 


arrays, 557, 561 
hint, 562 


Numeric data types 


floating-point 
double-precision, 5 
single-precision, 5 

signed integers, 5 

unsigned integers, 5 


O 


Optimization 


basic techniques, 534-536 
data alignment 
multi-byte values, 538 
packed floating-point, 538 
packed integer, 538 
floating-point arithmetic 
denormals, 536 
loop unrolling, 536 
precision, 536 
register dependencies, 536 


program branches 
backward conditional, 537 
branch prediction, 537 
forward conditional, 537 
loop unrolling, 537 

SIMD techniques 
register spills, 539 


PQ 


Packed floating-point arithmetic 
common operations 
addition, 170 
compares, 173-179 
conversions, 179-183 
division, 170-171 
multiplication, 170-171 
subtraction, 170 
compares, 452-457 
conversions 
unsigned integer, 440, 443 
logical decisions, 295, 346 
operations 
absolute value, 290-291 
addition, 288-289, 448-449 
division, 289, 448-449 
multiplication, 289, 448-449 
square root, 289-290, 448-449 
subtraction, 288-289, 448-449 
Packed integer arithmetic 
basic arithmetic 
doubleword, 502-503, 517-519, 
526-527 
word, 495 
common operations 
addition, 215-221 
multiplication, 226-232 
shifts, 221-226 
subtraction, 215-221 
operations 
addition, 350 
shifts, 350 
subtraction, 350 
pack and unpack, 352-357 
size promotions, 244, 253 
sign extended, 358, 362 
zero extended, 358, 362 


R 


Registers 
general purpose 
8-bit, 8 
16-bit, 8 
32-bit, 8 


INDEX 


64-bit, 8-9 
MXCSR, 6-7 
RFLAGS 
carry, 9-10 
direction, 9-10 
overflow, 9-10 
parity, 9-10 
sign, 9-10 
zero, 9-10 
RIP (instruction pointer), 10 
RSP (stack pointer), 9 
XMM, 6-7, 9 
YMM, 6-7 
ZMM, 6 


RFLAGS. See Registers 
Ring interconnect, 531 


S, T, U 


Scalar floating-point arithmetic 


arrays, 135 
double-precision, 109, 112 
matrices, 138-143 
operations 
addition, 117 
compares, 118 
conversions, 128 
division, 109 
multiplication, 117 
square root, 117 
subtraction, 117 
single-precision, 110-112 


SIMD. See Single Instruction 


Multiple Data (SIMD) 


Single Instruction Multiple Data (SIMD) 


arithmetic 
horizontal addition, 101 
horizontal subtraction, 101 
packed floating-point, 100-102 
packed integer, 103-105 
saturated, 90-91 
wrapround, 90-91 
data types 
xmmword, 6 
ymmword, 6 
zmmword, 6 
programming concepts, 88-89 


Smoothing operator 


Gaussian filter 
coefficients, 386 


Strings 


concatenation, 74-79 
counting characters, 71-73 
direction flag, 82, 85 
end-of-string character, 73 
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INDEX 


Structure 

member alignment, 68 

padding, 68 
Structure of arrays (SOA), 473, 476, 526-527 
System agent, 531 


VW 


Vector cross product 
component equation, 473 
gather, 466, 473 
opmask register, 473-474 
scatter, 466, 473 
Vector scale-index-base (VSIB). See AVX2 
Visual C++ 
calling convention 
epilog macros, 144, 159-166 
floating-point argument, 141 


floating-point return value, 112, 153, 165 


function epilog, 56, 152-153 
function prolog, 56, 159 
general-purpose register, 
143, 148-153 
integer argument, 146 
leaf function, 143, 147 
local storage, 147, 159, 262 
non-leaf function, 143-144, 146, 159, 
163-164 
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non-volatile register, 56, 143, 146-147, 151, 
157, 163, 165 
prolog macros, 144, 159-166 
register arguments, 23, 38-39 
returning structures by value, 356 
return value, 33, 70 
stack alignment, 146 
stack arguments, 23, 38-39 
stack frame, 143-148, 151, 157, 163-165 
stack layout, 146-147, 152, 158-159, 163-164 
volatile register, 56, 143 
XMM register, 143, 151, 153, 164-165 
ZMM registers, 436, 449 
decorated name, 26 
extern "C" modifier, 26 


X 


XmmVal, 171-172, 177, 182, 219-220, 225, 
230, 232, 290 


Y 


YmmVal, 290, 337, 351, 355 


Z 


ZmmvVal, 449-450, 455-456, 495 


